ArchiveBox is evolving: the future of self-hosted internet archives

(docs.sweeting.me)

595 points | by nikisweeting a day ago ago

143 comments

@nikisweeting ArchiveBox is awesome and we'd really love it to be more awesome. And sustainable!

I've posted issues and PRs for showstopper issues that took months to get merged in: https://github.com/ArchiveBox/ArchiveBox/issues/991 https://github.com/ArchiveBox/ArchiveBox/pull/1026

You have the opportunity for the community to lean in on ArchiveBox. I understand it's hard to do everything as a solo dev, we've seen many cases in the community where solo devs get burned out or have personal challenges that take priority etc.

It's hard for us users to lean in on ArchiveBox when after a happy month of archiving, things start break and you're left with maintaining a branch of your own fixes that aren't in main. Meanwhile, your solution of soliciting one time donations just makes the whole project feel more rickety and fly-by-night. How about thinking bigger?

We NEED ArchiveBox to be a real thing. Decentralized tooling for archiving is SO IMPORTANT. I care about it and I suspect many people do. I'm posting this so other people who care about it can also comment and chime in and suggest how it can become something we can rely on. Because archiving isn't just about the past, it's about the future.

Maybe it needs to be a dev org of three committed part-time maintainers, and a small foundation that people recurrently support is what grants it? IDK. I'm not an expert at how to make open source resilient. There have been discussions about this in the past, but I think it's worth a serious look because ArchiveBox is IMPORTANT and I want it to work any month I decide to re-activate my interest in it. I invite people to discuss ways to make this valuable project more sustianable and resilient.

[-]

nikisweeting a day ago

Let chat more. I'm almost ready to raise some seed money, hire a second staff dev or find a cofounder, and I'm looking for people that care deeply about the space.

It's only been during the last few months that I decided to go all in on the project, so this is still just the first few pages of a new chapter in the project's history.

(I should also mention that if you're a commercial entity relying on ArchiveBox, you can hire us for dedicated support and uptime guarantees. We have a closed source fork that has a much better test suite and lots of other goodies)

[-]

nyx 19 hours ago

It looks like you're doing great work here, thanks a bunch; looking forward to seeing this project develop.

Selling custom integrations, managed instances, white-glove support with an SLA, and so on seems like a reasonable funding model for a project based on an open-source, self-hostable platform. But I'm a little disheartened to read that you're maintaining a closed fork with "goodies" in it.

How do you decide which features (better test suite?) end up in the non-libre, payware fork of your software? If someone contributed a feature to the open-source version that already exists in the payware version, would you allow it to be merged or would you refuse the pull request?

[-]

nikisweeting 18 hours ago

The idea with the plugin system is that plugins are just git repos containing <pluginname>/__init__.py, and you can add any set of git repo plugins you want to your instance.

The marketplace will work by showing all git repos tagged with the "archivebox" tag on github.

My approval is only needed for PRs to the archivebox core engine.

More info on free vs paid + reasoning why it's not all open source: https://news.ycombinator.com/item?id=41863539

bigiain 18 hours ago

"I too would like commit access to your promising looking project's git repo and CI/CD pipeline. Thanks, Jia Tan"

[-]

msephton 15 hours ago

lololol

giancarlostoro 19 hours ago

Do you guys have a Discord by chance? I have a close friend who is insanely passionate about archiving, he has a personal instance of archivebox, and is working on a Video Downloading project as well. He has used it almost everyday and archived thousands of news articles over years. He's aware of a lot of the nuances.

[-]

nikisweeting 19 hours ago

We have a Zulip which is similar to discord (but self hosted and it has better threading): https://zulip.archivebox.io

manofmanysmiles 19 hours ago

I love this project. I "independently" "invented" it in my head the other day, and happy to see it already exists!

I'd love to see blockchain proof/notary support. The ability to say "content matching this hash existed at this time.

I'm exceptionally busy now but that being said, I may choose to contribute nonetheless.

I'd love to connect directly, and will connect to the Zulip instance later.

If we align on values, I may be able to connect you with some cash. People often call me an "anarchist" or "libertarian", though I'm just me, not labels necessary.

[-]

nophunphil 14 hours ago

Can you please explain what you mean by “blockchain proof/notary support”?

[-]

manofmanysmiles 13 hours ago

Motivation: Have evidence that some content existed at a particular time. For example, let's say a major website publishes an article, and later they remove it, and there is no record of it ever existing. If I host an ArchiveBox, I can look at it and see "Oh here is that article. Looks line it was published after all." However, why should you believe me I didn't just make it up?

If when I initially archived it, I computed a cryptographic hash of the content and posted that on a blockchain, then at a future date I can at least claim "As of block N, approximately corresponding to this time UTC, content that hashes to this hash exited."

If multiple unrelated parties also make the same claim, it is stronger evidence.

Is this sufficient explanation? I can expand on this more later.

[-]

jazzyjackson 13 hours ago

There's no reason to believe that the hashed and timestamped content was hosted at a particular domain, however (unless the content was signed by the author of course, then there's no Blockchain necessary). sure multiple peers could make some attestation that they saw it at that URL, but then you're back at square one of the reputation problem

Internet archive as an institution with a reputation that holds up to a judge is actually more valuable than a cryptographic proof that x bytes existed at y time

[-]

manofmanysmiles 12 hours ago

No, definitely not. I have no inherent reason to trust the people working at the Internet Archive over let's say close friend. For me trust is always a human to human concept, and no amount of tech or institutions will change that.

The more people I hear making a claim, the more I'm likely to deem the claim(s) as true. This is even true regarding the claims that cryptographic algorithms have the properties that make them useful in these contexts. I say this as someone who has even taken graduate level classes with Ron Rivest.

I'm not sure what will happen in a court. I imagine the more people that start making claims using cryptography as part of the supporting evidence, the more likely people will start to trust cryptography as a useful tool for resolving disputes about the veracity of claims.

So you would not get any value from multiple people making such claims?

[-]

nikisweeting 10 hours ago

I think the best solution is to have multiple people with reputation attest to the encrypted TLS content without being able to see the cleartext of it, that way they cant easily tamper with it.

See my comments on TLSNotary stuff below...

[-]

manofmanysmiles 9 hours ago

Woah, cool, yes, exactly this!

I think I read a paper or blog post about this concept a while ago, but never saw it implemented!

toomuchtodo a day ago

https://github.com/ArchiveTeam/grab-site might be helpful. I'm a fan of the ability to create WARC archives from a target, uploard the WARC files to object storage (whether that is IA, S3, Backblaze B2, etc), and then keep them in cold storage or serve them up via HTTPS or a torrent (mutable, preferred). The Internet Archive serves a torrent file for every item they host; one can do the same with WARC archives to enable a distributed archive. CDX indexes can be used for rapidly querying the underlying WARC archives.

You might support cryptographically signing WARC archives; Wayback is particular about archive provenance and integrity, for example.

https://www.loc.gov/preservation/digital/formats/fdd/fdd0005... ("CDX Internet Archive Index File")

https://www.loc.gov/preservation/digital/formats/fdd/fdd0002... ("WARC, Web ARChive file format")

https://github.com/internetarchive/wayback/tree/master/wayba... ("Wayback CDX Server API - BETA")

[-]

nikisweeting a day ago

I recommend Browsertrix for WARC creation, I think they are the best currently available for WARC/WACZ.

ArchiveBox is also gearing up to support real cryptographic signing of archives using https://tlsnotary.org/ in an upcoming plugin. (in a way that actually solves the TLS non-repudation issue, which traditional "signing a WARC" does not, more info: https://www.ndss-symposium.org/wp-content/uploads/2018/02/nd...)

[-]

toomuchtodo a day ago

Keep in mind, what signing methodology you use is a function of who accepts it. If I can confirm "ArchiveTeam ripped this", that is is superior to whatever tlsnotary is doing with MPC, blockchain, distributed ledger, whatever (in my use case). Have to trust someone at the end of the day. ArchiveTeam's Warrior doesn't use tlsnotary, for example, and rips entire sites just fine.

[-]

nikisweeting a day ago

The idea with TLSNotary is that you can have several universities or central agencies running signing servers but you dont have to share the cleartext content of your archives with them to get it signed.

This dramatically changes what is possible with signing because previously to get ArchiveTeam's signature of approval, they would have to see the content themselves to archive it. With TLSNotary they can sign without needing to see the content/access the cookies/etc.

[-]

viraptor 21 hours ago

Isn't that already possible with any kind of notary by giving them a sha256 of the content only? Or am I missing some distinction?

[-]

nikisweeting 20 hours ago

You can do that but it proves nothing because TLS session keys are symmetric, so the archiver can forge server responses and falsely attest that the server sent them.

Look up "TLS non repudiation"

A real solution like TLSNotary involves a neutral, reputable third party that can't see the cleartext attesting to the cyphertext using a ZK proof.

The neutral third party doing attestation can't see the content so they can't easily tamper with it, and attempts to tamper indiscriminately would be easily detected and ding their reputation.

digitaldragon 17 hours ago

Unfortunately, Browsertrix relies on the Chrome Devtools Protocol, which strips transfer encoding (and possibly transforms the data in other ways). This results in Browsertrix writing noncompliant WARC files, because the spec requires that the original transfer encoding be preserved.

[-]

ikreymer 12 hours ago

Unfortunately, there is not much we can do about transfer-encoding, but the data is otherwise exactly as is returned from the browser. Browsertrix uses the browser to create web archives, so users get an accurate representation of what they see in their browser, which is generally what people want from archives.

We do the best we can with a limited standard that is difficult to modify. Archiving is always lossy, we try to reduce that as much as possible, but there are limits. People create web archives because they care about not losing their stuff online, not because they need an accurate record of transfer-encoding property in an HTTP connection. If storing the transfer-encoding is the most important thing, then yes, there are better tools for that.

[-]

CorentinB 10 hours ago

You could use a proxy.

"Archiving is always lossy" No.

[-]

ikreymer 9 hours ago

Every archiving tool out there makes trade-offs about what is archived and how. No one preserves the raw TLS encrypted H3 traffic because that's not useful. When you browse through an archiving MITM proxy, there are different trade-offs: there's an extra HTTP connection involved (that's not stored), a fake MITM cert, and a downgrade of H2/H3 connection to HTTP/1 (some sites serve different content via H2 vs HTTP/1.1, can detect differences, etc...)

The web is best-effort, and so is archiving the web.

nikisweeting 9 hours ago

You're talking to the guy who built the best proxy recorder in the archiving industry ;) ikreymer created https://pywb.readthedocs.io/en/latest/

I think he has more context than any of us on the limits of proxy archiving vs browser based archiving.

But also if you really need perfect packet-level replication, just wireshark it as he said. Why bother with WARCs at all?

[-]

pabs3 9 hours ago

pywb has WARC issues too, due to use of warcio:

https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem

pzmarzly a day ago

Can you recommend some tools to manage mutable torrents? I.e. create them, edit them, download them and keep them downloaded up to date.

BTW I recently tried using IPFS for a mutable public storage bucket and that didn't go well - downloads were very slow compared to torrents, and IPNS update propagation took ages. Perhaps torrents will do the job.

[-]

nikisweeting a day ago

My plan is to use a separate control plane for the discovery/announcements of changes, and torrents just for the data transfer. The specifics are still being actively discussed, and it's a few releases away anyway.

Apocryphon a day ago

Man, looks like the first posts about IPFS cropped up on HN a decade ago. I remember seeing Neocities announcement of support for them. I wonder if that protocol has gotten anywhere since then.

[-]

jazzyjackson 13 hours ago

There has been a large effort extended by Internet archive to adopt IPFS through their partnership with filecoin but IME the basic problems of the protocol remain - slow egress, slow discovery, someone still has to serve the file over a gateway to normie HTTP users...

0cf8612b2e1e a day ago

  The Internet Archive serves a torrent file for every item they host

I had no idea. I have found the IA serving speed to be pretty terrible. Are the torrents any better? Presumably the only ones seeding the files are IA themselves.

[-]

toomuchtodo a day ago

The benefit is not in seeding speed directly from IA, but the potential for distributed access and seeding of the item. Think of it as a filename of a zip file in a flat distributed filesystem, with the ability to cherrypick files that make up the item out via traditional bittorrent mechanisms. Anyone can consume each item via torrent, continue to seed, and then also access the underlying data. IA acts as the storage system of last resort (and the metadata index).

pabs3 18 hours ago

The torrents have better speeds because they have WebSeeds for multiple IA servers, so you can download from multiple servers at once.

bityard 18 hours ago

So, after reading through the comments and website, I just realized I used ArchiveBox a month or two ago for a very specific purpose.

You see, I inherited a boat.

This boat belonged to my father. He was not materialistic but he took very good care of the things he cared about, and he cared about this boat. It's an old 18' aluminum fishing/cruising boat built in the early 1960's. It's not particularly valuable as a collectible but it is fairly rare and has some unique modifications. I spent a lot of time trying to dig up all of the info that I could on it, but this is one of those situations where most of the companies involved have been gone for decades and most everyone who was around when these were made are either dead or not really on the Internet.

It's a shame that I waited so long to start my research because 10 or 20 years ago, there were quite a few active web forums containing informational/tutorial threads from the proud owners of these old boats. I know because I have seen references to them. Some of the URLs are in archive.org, some are not. But the forums are gone, so a large chunk of knowledge on these boats is too, probably forever.

I did manage to dig up some interesting articles, pictures, and forum threads and needed a way to save them so that they didn't disappear from the web as well. There is probably an easier way to go about it, but in the end I ran ArchiveBox via Docker and set it to fetching what I could find and then downloaded the resulting pages as self-contained HTML pages.

[-]

shiroiushi 16 hours ago

>because 10 or 20 years ago, there were quite a few active web forums containing informational/tutorial threads from the proud owners of these old boats. ... But the forums are gone, so a large chunk of knowledge on these boats is too, probably forever.

These days, that kind of info would be locked up in a closed Discord chat somewhere, so you can forget about people 20 years from now ever seeing it.

[-]

stavros 15 hours ago

Or people today ever discovering it.

Magnets 6 hours ago

Lots of private groups on facebook too

nfriedly a day ago

I've been using an instance of https://readeck.org/ for personal archives of web pages and I really like it, but I might try out ArchiveBox at some point too.

I also run an instance of ArchiveTeam Warrior which is constantly uploading things to archive.org, and I like the direction ArchiveBox is heading with the distributed/federated archiving on the roadmap, so I may end up setting up an instance like that even if I don't use it for personal content.

[-]

venusenvy47 a day ago

I've been using the Single File extension to save self-contained html files of pages I want to keep for posterity. I like it because any browser can open the files it creates. Is it easy to view the archive files from readeck? I haven't looked at fancier alternatives to my existing solution.

https://addons.mozilla.org/en-US/firefox/addon/single-file/

[-]

ninalanyon 7 hours ago

Readeck saves a page as a zip file. It's not hard to open from the command line or file manager, just unzip and launch the index.html in the web browser.

But it strips out a lot of detail. Zipping it also means that it's hard to deduplicate. I use WebScrapBook and run rdfind to hardlink all the identical files.

nikisweeting a day ago

Singlefile is excellent, Gildas is a great developer. ArchiveBox has had singlefile as one of its extractors built in for years :)

[-]

gildas 13 hours ago

Thank you so much Niki :). The P2P sharing is a great idea. I really hope this feature will get things moving in the archiving field.

nfriedly a day ago

I haven't looked at the on-disk format, I just use the browser interface. (It's fairly common for me to save something from my phone that I'll want to review on a computer later.)

Here's an example of an Amazon "review" I recently archived that has instructions for using a USB tester I have: https://readeck.home.nfriedly.com/@b/tCngVjkSFOrCbwb9DnY2yw

And, for comparison, here's the original: https://www.amazon.com/gp/customer-reviews/R3EF0QW6MAJ0VP

It'd be nice if I could edit out the extra junk near the top, but the important bits are all there.

[-]

ashildr a day ago

I was about to post a link to the same URL but archived using singleFile, which looks like the original at amazon. I didn‘t because I realized that I have absolutely no idea what additional information would be hidden in the file. In the worst case any component sent by Amazon and archived into the file may contain PII, even if I am “logged out“.

I‘m not saying that singleFile is bad in any way, I‘m using it a lot on multiple devices, but I‘m not sure whether sharing archives is a good idea™.

[-]

nikisweeting 21 hours ago

100%, this is the challenge of archiving logged in content.

It becomes un-shareable unless we use fake burner accounts for capture, or have really good sanitizing methods.

[-]

ashildr 21 hours ago

Even when I‘m logged out I expect at least information on my geographical location to seep into the archive via URLs addressing specific CDN endpoints or similar mechanisms.

[-]

nikisweeting 21 hours ago

Yup, this is why the ArchiveBox browser extension sends URLs to a separate server for archiving with an isolated burner profile.

I should write a full article on the security implications at some point, there aren't many good top-down explanations of why this is a hard problem.

[-]

ashildr 15 hours ago

I know it’s a lot of work but this would be great and it may give readers a deeper understanding into security in general.

ninalanyon 7 hours ago

How does it save pages that are only available when you are logged in such as social networking pages?

ninalanyon 7 hours ago

I've just tried Readeck and it doesn't save a good quality copy of the pages using the Firefox extension. SingleFile and WebScrapBook do a much better job.

I prefer WebScrapBook because it saves all the assets as files under the original names in a directory rather than a zip file. This means that I can use other tools such as find, grep, and file managers like Nemo to search the archive without needing to rely on the application that saved the page.

nikisweeting a day ago

I love ArchiveTeam warrior, it's such a good idea! We run several instances ourselves, and it's part of our Good Karma Kit for computers with spare capacity: https://github.com/ArchiveBox/good-karma-kit

There are a bunch of other alternatives like ReadDeck listed on our wiki too, we encourage people to check it out!

https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...

hooverd 6 minutes ago

Does anyone have recommendations for the hardware side of self-hosting something like that? How do I avoid bit rot?

Also I see django-ninja! Very cool.

404mm a day ago

Somewhat similar topic, anyone has recommendations for a self-hosted internet website change monitoring system? I’ve been running Huginn for many years and it works well; however, I have a feeling the project is on its last leg. Also, it’s based on either text scraping (XPath/CSS/HTML and rss but it struggles with newer JS-based sites.

[-]

pabs3 11 hours ago

I recommend urlwatch, you run it from a terminal and send the output wherever you want, such as email via cron.

https://thp.io/2008/urlwatch/

nikisweeting a day ago

Changedetection.io

[-]

404mm 19 hours ago

Thank you! That looks great!

arminiusreturns a day ago

Why do you feel like Huginn is on its last leg? It's been in my list of things to play with for years now, but I never got around to it...

[-]

404mm 21 hours ago

It looks like it’s being maintained by a single remaining developer. No new features are being added, just some basic maintenance. The product as a whole still works well, so unless you find something better, I do recommend it. I run it in k3s and the image is probably the easiest way of maintaining it.

rcarmo 21 hours ago

This is nice. I'm actually much more excited about the REST API (which will let me do searches and pull information out, I hope) than the plugin ecosystem, since the last thing I need is for another tool to have a half-baked LLM integration -- I prefer to do that myself and have full control.

Being able to do RAG on my ArchiveBox is something that I have very much wanted to do for over a year now, and it might finally be within reach without my going and hacking at the archived content tree...

Edit: Just looked at the API schema at https://demo.archivebox.io/api/v1/docs.

No dedicated search endpoint? This looks like a HUGE missed opportunity. I was hoping to be able to query an FTS index on the SQLlite database... Have I missed something?

[-]

nikisweeting 21 hours ago

The /cli/list endpoint is the search endpoint you're looking for. It provides FTS but I can make it clearer in the docs, thanks for the tip.

As for the AI stuff don't worry, none of it is touching core, it's all in an optional community plugin only for those who want it.

I'm not personally a huge AI person but I have clients who are already using it and getting massive value from it, so it's worth mentioning. (They're doing some automated QA on thousands of collected captured and feeding results into spreadsheets)

[-]

rcarmo 21 hours ago

Thanks, I'll have a look.

My use for this is very different--I want to be able to use a specific subset of my archived pages (which is mostly reference documentation) to "chat" with, providing different LLM prompts depending on subset and fetching plaintext chunks as reference info for the LLM to summarize (and point me back to the archived pages if I need more info).

[-]

nikisweeting 20 hours ago

Ok that makes sense, I think archivebox works as the first step in a pipeline there, with some other tool doing the LLM analysis and query stuff.

[-]

rcarmo 9 hours ago

Yep. That's what I've built for myself, I just can't really get at the data inside ArchiveBox until I upgrade.

sunshine-o 19 hours ago

I have been using ArchiveBox recently and love it.

About search, one thing I haven't yet figured out how to do easily is to plug it to my SearXNG instance as they only seem to support Elasticsearch, Meilisearch or Solr [0]

So this new plugin architecture will allow for a meilisearch plugin I guess (with relevancy ranking).

- [0] https://docs.searxng.org/dev/engines/offline/search-indexer-...

[-]

nikisweeting 19 hours ago

Definitely doable! Search plugins are one of the first that I implemented.

We already provide Sonic, ripgrep, and SQLiteFTS as plugins, so adding something like Solr should be straightforward.

Check out the existing plugins to see how it's done: https://github.com/ArchiveBox/ArchiveBox/pull/1534/files?fil...

archivebox/plugins_search/sonic/*

favorited 20 hours ago

As someone who was archiving a doomed website earlier today using wget, I was reminded that really need to get ArchiveBox working...

I used to rely on my Pinboard subscription, but apparently archive exports haven't worked for years, so those days are over.

[-]

nikisweeting 20 hours ago

Pocket also doesn't offer archived page exports (or even RSS export). I feel like both are really dropping the ball in this area!

VTimofeenko 18 hours ago

I recently found omnivore.app through HN comments -- works great for sharing a reading list across machines. I am exporting articles through obsidian, but there is an API option. I don't think it supports outbound RSS, but they have inbound RSS(i.e. omnivore as RSS reader) in beta.

pronoiac 15 hours ago

Oh, writing my own Pinboard archive exporter is somewhere on my too-long to-do list. I should find out what would be good for importing into Archivebox. (WARC?)

dewey 6 hours ago

I've tried to get started with ArchiveBox many times, but it was always quite buggy (Not working in Safari, a bit clunky to run,...) but I've noticed a lot of updates in the past months so I'm excited about it moving forward and giving it another shot.

orblivion 20 hours ago

Have you (and I wonder the same about archive.org) considered making a Merkle tree of the data that gets archived? Since data (including photos and videos) are getting easier to fake, it may be nice to have a provable record that at least a certain version of the data existed at a certain time. It would be most useful in case of some sort of oppressive regime down the line that wants to edit history. You'd want to publish the tip somewhere that records the time, and a blockchain seems to make the most sense to me but maybe you don't like blockchains.

[-]

nikisweeting 20 hours ago

Yup, already doing that in the betas. Thats what I'm referring to as the beginnings of a "content addressable store" in the article.

In the closed source fork we currently store a merkle tree summary of each dir in a dotfile containing the sha256 and blake3 hash of all entries / subdirs. When a result is "sealed" the summary is generated, and the final salted hash can be submitted to Solana or ETH or some other network to attest to the time of capture and the content. (That part is coming via a plugin later)

[-]

zvr 6 hours ago

You might be interested in taking a look at SWHID (Software Hash IDentifiers), which defines a way (on its way to become an ISO standard) to reference files and directories with content-based identifiers, like swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505. Yes, it uses Merkle trees for filesystem hierarchy. https://www.swhid.org/specification/v1.1/5.Core_identifiers/

orblivion 20 hours ago

Wow that's great!

beefnugs 19 hours ago

Not just all that nonsense, but also it makes a lot of sense to share just the parts from a website that matter like a single video etc without having to download an entire archive or the rest of the site

[-]

nikisweeting 19 hours ago

$ archivebox add --extractor=media,readability https://...

We try to make that easy by allowing ppl to select one or more specific archivebox extractors when adding, so you don t have to archive everything every time.

Makes it more useful for scraping in a pipeline with some other tools.

pabs3 18 hours ago

Unfortunately ArchiveBox uses wget, so it produces non-standard WARC files. Sadly there are lots of things like this in the WARC ecosystem.

https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem

[-]

nikisweeting 18 hours ago

Yes, this is true currently. If you need nice WARCs I recommend Browsertrix by our friends at Webrecorder instead.

Its on my roadmap to improve this eventually, but currently I'm focused on saving raw files to a filesystem, because it's more accessible to most users, and easier to pipe into other tools.

I encourage people to use ZFS to do deduping and compression at the filesystem layer.

[-]

TheTechRobo 16 hours ago

Browsertrix (and Webrecorder tools in general) also violate the standard by modifying response data. It's supposed to be the raw bytes as they are sent over the network (minus TLS).

The entire WARC ecosystem is kind of a mess.

[-]

ikreymer 13 hours ago

This isn't really true, our tools do not just modify response data for no reason!

Our tools do the best that we can with an old format that is in use by many institutions. The WARC format does not account for H2/H3 data, which is used by most sites nowadays.

The goal of our (Webreocrder) tools is to preserve interactive web content with as much fidelity as possible and make them accessible/viewable in the browser. That means stripping TLS, H2/H3, sometimes forcing a certain video resolution, etc.. while preserving the authenticity and interactivity of the site. It can be a tricky balance.

If the goal is to preserve 'raw bytes sent over the network' you can use Wireshark / packet capture, but your archive won't necessarily be useful to a human.

[-]

CorentinB 10 hours ago

He didn't say you modify the data for no reason, he said you violate the standard. Which is true. You could respect it, but you don't.

[-]

nikisweeting 9 hours ago

imo the Webrecorder stuff is truly state of the art, if they're pushing the limits of WARC standards it's for good reason, and I trust their judgement. They pioneered the newer WACZ standard and are really pushing the whole field forward.

grinch5751 a day ago

This looks like a really wonderful set of developments. Already making plans to use an old laptop of mine as an achivebox machine.

agnishom 9 hours ago

This is interesting. I personally use Omnivore + backup to Obsidian for this purpose.

joeross 14 hours ago

I have no programming skill at all and I don’t know a ton about ArchiveBox except I set it up and ran it for myself for a while, so I’m asking as an innocent, ignorant and curious geek, but is this something that could be adapted to peer to peer distribution or some other means of making it simultaneously as private and local as you want it and as distributed and bulletproof, uptime wise, as possible?

sagz a day ago

Do y'all support archiving pages that are behind logins? Like using browser cookies?

[-]

markerz a day ago

Yes, but there's security concerns where you might accidentally leak your credentials / cookies if you publish your archive to the public.

https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overv...

PS. I'm an archivebox user, not a dev or maintainer.

[-]

nikisweeting a day ago

Yes this is correct, with plans to make this easier in the near future via setup wizard that guides you through creating dedicated credentials for archiving.

petertodd 20 hours ago

You really should add timestamping to ArchiveBox. The easiest way to do that would be via my OpenTimestamps protocol, https://opentimestamps.org It's open source and free to use, and uses Bitcoin for the actual timestamps. Users of it do not need to make Bitcoin transactions themselves as a set of community calendar servers do that for you. You also don't need a Bitcoin node to create an OTS timestamp, and you can validate an OTS timestamp without a Bitcoin node as well by trusting someone else to do that for you.

The big thing that ArchiveBox can't do, and the Internet Archive can, is attest to the accuracy of the archive. Being at least able to prove that the archive was created in the past, prior to there being a reason to tamper it, is the best we can realistically do with current cryptography. So it'd be really good if support for timestamping was added.

IIUC ArchiveBox is written in Python; OTS has a Python library that should work fine for you: https://github.com/opentimestamps/python-opentimestamps

[-]

nikisweeting 20 hours ago

We're going to add TLSNotary support for real cryptographic signing, see my comments below :)

Timestamping is also on my roadmap, definitely as a plugin (and likely paid) as it's more corporate users that really need it. We need to keep some of the really advanced attestation features paid to be able to support the rest of the business.

[-]

petertodd 15 hours ago

> We're going to add TLSNotary support for real cryptographic signing, see my comments below :)

Last I checked TLSNotary requires a trusted third party. I would strongly suggest timestamping TLSNotary evidence, to be able to prove that evidence was created prior to any of these trusted third parties being compromised.

[-]

nikisweeting 9 hours ago

Of course, TLSNotary stuff would necessarily come with a whole ecosystem, including some sort of transparency log like certificate transparency logs, DNS record keeping, timestamping, etc.

But we'll start with the basics and work our way up to completeness.

mikae1 20 hours ago

Thanks for the box!

Any examples of other possible really advanced features that might go for-pay?

Is there any chance you will make current free features for-pay? That'd be rather off-putting for me as a home user.

[-]

nikisweeting 20 hours ago

No, everything currently free will stay free.

The paid stuff currently is:

- per-user permissions & groups

- audit logging

- auto CAPTCHA solving

- burner credential management for FB/Insta/Twitter/etc. w/ auto phone based account verification ability

- custom JS scripts for expanding comments, hiding pop ups, etc.

- managed hosting + support

Some of this stuff ^ is going to become free in upcoming releases, some will stay paid. What I decide to make free is mostly based on abuse potential and legal ramifications, I'd rather have a say in how the risky stuff is used so that it doesn't become a tool weaponized for botting.

[-]

mikae1 11 hours ago

Thanks for the clarification and thanks again for the great work!

jasonfarnon 20 hours ago

I always wonder about this when someone gets in hot water based on something on the wayback machine and the person says the archive was tampered with. Can you elaborate on "prove that the archive was created in the past, prior to there being a reason to tamper it"? What exactly does opentimestamps certify?

[-]

nikisweeting 20 hours ago

OpenTimestamps alone can not currently prove anything because TLS session keys are symmetric. The client can forge anything and attest to it falsely. Unless you 100% trust the archiver (in which case you can trust their timestamps), you need TLSNotary or another reputable third party in the loop as a bare minimum.

But more critically: currently the legal standard for evidence is... screenshots. We have a lot of educating work to do before the public understands the value of attestation and signing.

[-]

petertodd 15 hours ago

> OpenTimestamps alone can not currently prove anything because TLS session keys are symmetric.

Timestamps can prove that the data existed prior to there being a known reason to modify it. While that's not as good as direct signing, that's often still enough to be very useful. The statement that OTS "can not currently prove anything" is incorrect.

A really good example of this is the Hunter Biden email verification. I used OpenTimestamps to prove that the DKIM key that signed the email was in fact used by Google at the time, by providing a Google-signed email that had been timestamped years ago: https://github.com/robertdavidgraham/hunter-dkim/tree/main/o...

That's convincing evidence, because it's highly implausible that I would have been working to fake Hunter's emails years before they even came up as an election issue.

[-]

nikisweeting 9 hours ago

Ok, fair point, they prove that content existed at some point in time, which is useful sometimes. But I don't want people to over-rely on that as "good enough", we can do much better, it's too low a bar for a whole ecosystem of archiving to rely on when we now have a viable solution to fix it (TLSNotary or others).

treyd a day ago

Is this a project that could be developed to support a distributed mirror of archive.org similar to how Anna's Archive works?

[-]

nikisweeting a day ago

Yeah that's what we're aiming for eventually, but with the addition of fine-grained permissions controls so you don't have to share everything 100% publically, you can choose a subset.

https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap

wongarsu a day ago

Does this mean it's now possible to write plugins that dismiss cookie popups, solve captchas, scroll web pages etc.?

[-]

nikisweeting a day ago

I have a private plugin with puppeteer support for stuff like this, currently charging clients money to use it to fund the open source development. The clients are people who are already legally allowed to evade CAPTCHAS (e.g. governments, NGOs doing research, lawyers collecting evidence, etc.)

Unfortunatley I cant open source the CAPTCHA solving stuff myself, because it opens me up to liability, but if someone wants to contribute a plugin to the ecosystem I cant stop them ;).

[-]

0x1ch a day ago

Legally allowed to evade CAPTCHAs? LOL.

What world do we live in where evading a captcha is an illegal offense?

[-]

nikisweeting 21 hours ago

It doesn't matter whether or not it's actually legal, what matters is that the big platforms will sue you for trying, so you need a big bankroll to stand your ground.

At the very least they can bar you from accessing their sites as you're violating ToS that you accept upon signup.

xiconfjs 11 hours ago

You mean ArchiveBox still doesn’t deal with cookie popups? If so, it’s quasi not useful for EU based web sites.

[-]

nikisweeting 9 hours ago

It does, you just have to set up a chrome profile that has an extension to hide cookie popups, or use a profile where you've already accepted/closed them and have a session.

You can archive with any chrome profile with arbitrary extensions enabled, so you can use uBlock, I still Don't care about cookies, Ghostery, etc.

[-]

ajvs 7 hours ago

How do you set this up? I found this relevant issue[1] but it doesn't explain how to get it working.

[1] https://github.com/ArchiveBox/ArchiveBox/issues/211

chillfox 16 hours ago

Awesome, I am really looking forward to the new api and plugins.

I have been running an instance for almost 2 years now that I use for archiving articles that I reference in my notes.

newman314 16 hours ago

@nikisweeting Is abx-dl already available or is it coming? I took a quick dive and didn't see a repo under the org.

I'm happy to help package this up once it is available.

[-]

nikisweeting 9 hours ago

Not currently available, it should be out soon after v0.9 is released.

Currently `mkdir tmp_data && cd tmp_data; archivebox install; archivebox add ...` is effectively equivalent to what `abx-dl` will do.

millvalleydev a day ago

For devs like us, archivebox? or browsertrix-crawler? for scraping entire sites for our own uses, maybe to keep contents behind pay walls while we have subscriptions or maybe to feed them to local LLMs to ask?

[-]

nikisweeting 21 hours ago

For scraping entire sites browserteix is currently more suited until we add full depth recursive crawling in v0.9. For feeding to LLMs ArchiveBox MIGHT BE better (imho) because we extract the raw content and you likely don't need the whole WARC.

rodolphoarruda 21 hours ago

> "In an era where fear of public scrutiny is very tangible, people are afraid of archiving things for eternity. As a result, people choose not to archive at all, effectively erasing that history forever."

Really? I don't get that feeling at all. I use Evernote to archive anything I consider worth keeping. I wonder where such "fear of archiving" comes from.

[-]

nikisweeting 21 hours ago

A lot of people are retreating off public free-for-all platforms like Twitter to more siloed spaces like Discord, for many reasons, not just fear of archiving.

It all has the same effect of making it harder to archive though.

A4ET8a8uTh0 18 hours ago

Those additions are welcome, but if I could request one -- I and one that it is very consistently requested -- feature:

- backing up an entire page

Yes, it is hard. Yes, for non-pure html pages is extra kind of painful, but that would honestly making archivebox go from nice to have to.. yes, I have an actual archive I can use when stuff goes down.

[-]

nikisweeting 17 hours ago

Do you mean backing up an entire domain? Like example.com/*

If so that's starting to roll out in v0.8.5rc50, check out the archivebox/crawls/ folder.

If you mean archiving a single page more thoroughly, what do you find is missing in Archivebox? Are you able to get singlefile/chrome/wget html when archiving?

[-]

A4ET8a8uTh0 17 hours ago

Edit: The first option. ( previous stuff removed )

Lemme check my current version ( edit: 0.7.2 -- ty, I will update and test soon :D)

[-]

nikisweeting 17 hours ago

Ah ok. One caveat: it's only available via the 'archivebox shell' / Python API currently, the CLI & web UIs for full depth crawling will come later.

You can play around with the models and tasks, but I would wait a few weeks for it to stabilize and check again, it's still under heavy active development

Check archivebox/archivebox:dev periodically

[-]

A4ET8a8uTh0 16 hours ago

No worries. I can do that.

You guys probably hear it all the time, but you are doing lords work. If I thought I could be of use in that project, I would be trying to contribute myself ( in fact, let me see if there a way I can participate in a useful manner ).

[-]

nikisweeting 9 hours ago

Thanks! I love working on archiving so far, and it's been very motivating to see more and more people getting into archiving lately.

dark-star 19 hours ago

Some time ago I installed ArchiveBox on a RaspberryPi 4 running k3s (a lightweight Kubernetes distro).

I have documented that here: https://darkstar.github.io/2022/02/07/k3s-on-raspberrypi-at-...

Note that this was a rather old version and some things have probably changed compared to now, so YMMV, but it might still provide a good reference for those who want to try

[-]

nikisweeting 18 hours ago

Thanks for making that tutorial!

Happy to report that most of the quirks you cover have been improved:

- uid 999 is no longer just enforced, you can pass any PUID:GUID now (like Linuxserver.io containers)

- it now accepts ADMIN_USERNAME + ADMIN_PASSWORD env vars to create an initial admin user on first start without having to exec

- archivebox/archivebox:latest is 0.7.2 (yearly stable release) and :dev is the 0.8.x pre-release updated daily. All Images are all amd64 & arm64 compatible.

- singlefile and sonic are now included in all images & available on all platforms amd64/arm64

[-]

dark-star 18 hours ago

yeah I really need to update that guide. Since I published it I have updated ArchiveBox locally to a newer version but never bothered to update the guide :)

Acrobatic_Road a day ago

The subline mentions "Auto-login", but the article never elaborates on this. Does this mean we will be able to more easily archive non-public websites?

Also, how do you plan to ensure data authenticity across a distributed archive? For example, if I archive someone's blog, what is stopping me from inserting inflammatory posts that they never wrote, and passing them off as the real deal? Slight update: I see you're using TLS Notary! That's exactly what I would have suggested!

[-]

nikisweeting 21 hours ago

Auto log in is currently a service I provide for paying clients, and you can do it in the open source version manually with some extra config.

Working hard on making it more accessible in the future, and plugins should help!

FiniteField 21 hours ago

Disappointing that a project that should ostensibly care about preserving the open, non-centralised internet takes the time to namedrop and talk about making "compromises" against preserving a well-known, medium-sized clearnet forum legally operated from a US-based LLC. Still-living independent forum sites in this day and age have unrivalled SNR of actual human-to-human communication, there should be no better candidate for archival. It's sad that a self-hosted archival tool has to apologise for any "evil" content it might be used for in the first place. Tape recorders do not require a disclaimer about people saying "hate speech" into them.

[-]

nikisweeting 21 hours ago

Sorry which medium sized forum are you referring to?

I love forums and want them to continue, I'm not sure where you got the idea that I dislike them as a medium. I was just pointing out that public sites in general have started to see some attrition a bit lately for a variety of reasons, and the tooling needs to keep with new mediums as they appear.

I also make no apology for the content, in fact ArchiveBox is explicitly designed to archive the most vile stuff for lawyers and governments to use for long term storage or evidence collection. One of our first prospective clients was the UN wanting to use it to document Syrian war crimes. The point there was that we can save stuff without amplifying it, and that's sometimes useful in niche scenarios.

Lawyers/LE especially don't want to broadcast to the world (or tip off their suspect) that they are investigating or endorsing a particular person, so the ability to capture without publicly announcing/mirroring every capture is vital.

[-]

dark-star 18 hours ago

I guess he's talking about K_wi F_rms which was mentioned in one of the screenshots...

[-]

nikisweeting 17 hours ago

Ahh that makes sense. Well all I can say to that is that it's not up to me what's evil. The point I was trying to make is: sometimes you want to archive something that you don't endorse / don't want to be publicly linked.

You might not want to amplify and broadcast the fact that you're archiving it to the world.

the_gorilla a day ago

I don't know how anyone manages to use archivebox. I've tried it twice in the last 3 years and its site compatibility is bad, it quietly leaks everything you archive to archive.org by default, and whenever it fails on a download it stops archiving anything even after deleting and resubmitting all the jobs.

I'm sure it works for some people, but not me.

[-]

nikisweeting a day ago

These are legitimate gripes that have plagued specific past releases, I hear your frustration. Please keep in mind this was a solo effort of a single developer, only worked on in my spare time over the last 7 years (up until very recently).

The new v0.8 adds a BG queue specifically to deal with the issue of stalling when some sites fail. There was a system to do this in the past, but it was imperfect and mostly optimized for the docker setup where a scheduler is running `archivebox update` every few hours to retry failed URLs.

Site compability is much improved with the new BETA, but it's a perpetual cat and mouse game to fix specific sites, which is why we think the new plugin system is the way forward. It's just not sustainable for a single company (really just me right now) to maintain hundreds of workarounds for each individual site. I'm also discussing with the Webrecorder and Archive.org teams how we can to share these site-specific workarounds as cross-compatible plugins (aka "behaviors") between our various software.

> it quietly leaks everything you archive to archive.org by default

It's prominently mentioned many times (at least 4) on our homepage that this is the default, and archiving public-only sites (which are already fair game for Archive.org) is a default for good reason. Archiving private content requires several important changes and security considerations. More context: https://news.ycombinator.com/item?id=26866689

[-]

freedomben a day ago

Yeah, I'm not sure whether archive.org should be defaulted to on or off (I see both sides of that one), but its existence is definitely surfaced.

I love Archive Box btw, thank you for your effort! It's filling a very important need.

the_gorilla a day ago

I can accept the other issues, but archivebox needs be private and secure by default.

Sending everything to archive.org is bad default value and it erodes a certain level of trust in the project. Requiring "several important changes and security considerations" just makes a non-starter. The default settings should be "safe" for the default user, because as you mentioned in that post, 90% of users are never going to change them. Users should be able to run it locally and archive data without worrying about security issues, unless you only want experts to be able to use your software.

Also a contradiction between your statement and your blogpost, someone saving their photos isn't going to be want to worry about whether they configured your tool correctly or leaking all the group logs or grandma's photos.

>It's prominently mentioned many times (at least 4) on our homepage that this is the default, and archiving public-only sites (which are already fair game for Archive.org) is a default for good reason. Archiving private content requires several important changes and security considerations. More context

> Who cares about saving stuff?

> All of us have content that we care about, that we want to see preserved, but privately:

> families might want to preserve their photo albums off Facebook, Flickr, Instagram

> individuals might want to save their bookmarks, social feeds, or chats from Signal/Discord

> companies might want to save their internal documents, old sites, competitor analyses, etc.

I want the project to do well but it really needs to be secure by default.

[-]

nikisweeting a day ago

> The default settings should be "safe" for the default user,

I 100% agree, but because private archiving is doable but NOT 100% safe yet I cant make that mode the default. The difficult reality currently is that archiving anything non-public is not simple to make safe.

Every capture will contain reflected session cookies, usernames, and PII, and other sensitive content. People don't understand that this means if they share a snapshot of one page they're potentially leaking their login credentials for an entire site.

It is possible to do safely, and we provide ways to achieve that that I'm constantly working on improving, but until it's easy and straightforward and doesn't require any user education on security implications, I cant make it the default.

The goal is to get it to the point where it CAN be the default, but I'm still at least 6mo away from that point. Check out the archivebox/sessions dir in the source code for a look at the development happening here.

Until then, it requires some user education and setting up a dedicated chrome profile + cookies + tweaking config to do. (as an intentional barrier to entry for private archiving)

[-]

bigiain 18 hours ago

That's a really good response, thanks.

I've been very impressed by all of your responses in here, but that one in particular shows empathy, compassion, and a deep deep subject matter expertise.

[-]

nikisweeting 17 hours ago

Thank you. And thank you for taking the time to read all of it, there's a lot of great questions being asked.

Apocryphon a day ago

Perhaps this data is "private" as in "personal property" and not "private" as in "confidential."

[-]

nikisweeting a day ago

It's intended for both but it currently requires extra setup to do "confidential" because there are security risks.

hobs a day ago

As a custom tool built to archive stuff for archive.org, why would you expect that it can also do a completely opposite task, saving information privately?

I can see why you would want such a tool, but it seems like a direct divergence from the core goal of the existing codebase.

[-]

the_gorilla a day ago

[flagged]

[-]

dang 20 hours ago

We've banned this account for breaking the site guidelines. Please don't create accounts to break HN's rules with.

https://news.ycombinator.com/newsguidelines.html