Inside The Internet Archive's Infrastructure

(hackernoon.com)

131 points | by dvrp 2 days ago ago

21 comments

hedora 2 hours ago

It's frustrating that there's no way for people to (selectively) mirror the Internet Archive. $25-30M per year is a lot for a non-profit, but it's nothing for government agencies, or private corporations building Gen AI models.

I suspect having a few different teams competing (for funding) to provide mirrors would rapidly reduce the hardware cost too.

The density + power dissipation numbers quoted are extremely poor compared to enterprise storage. Hardware costs for the enterprise systems are also well below AWS (even assuming a short 5 year depreciation cycle on the enterprise boxes). Neither this article nor the vendors publish enough pricing information to do a thorough total cost of ownership analysis, but I can imagine someone the size of IA would not be paying normal margins to their vendors.

[-]

toomuchtodo 2 hours ago

Pick the items you want to mirror and seed them via their torrent file.

https://help.archive.org/help/archive-bittorrents/

https://github.com/jjjake/internetarchive

https://archive.org/services/docs/api/internetarchive/cli.ht...

u/stavros wrote a design doc for a system (codename "Elephant") that would scale this up: https://news.ycombinator.com/item?id=45559219

(no affiliation, I am just a rando; if you are a library, museum, or similar institution, ask IA to drop some racks at your colo for replication, and as always, don't forget to donate to IA when able to and be kind to their infrastructure)

qingcharles 12 minutes ago

The fact AI companies are stripping mining IA for content and not helping to be part of the solution is egregious.

nodja an hour ago

It's insane to me that in 2008 a bunch of pervs decentralized storage and made hentai@home to host hentai comics. Yet here we are almost 20 years later and we haven't generalized this solution. Yes I'm aware of the privacy issues h@h has (as a hoster you're exposing your real IP and people reading comics are exposing their IP to you) but those can be solved with tunnels, the real value is the redundant storage.

philipkglass 2 hours ago

I would like to be able to pull content out of the Wayback Machine with a proper API [1]. I'd even be willing to pay a combination of per-request and per-gigabyte fees to do it. But then I think about the Archive's special status as a non-profit library, and I'm not sure that offering paid API access (even just to cover costs) is compatible with the organization as it exists.

[1] It looks like this might exist at some level, e.g. https://github.com/hartator/wayback-machine-downloader, but I've been trying to use this for a couple of weeks and every day I try I get a HTTP 5xx error or "connection refused."

[-]

toomuchtodo an hour ago

https://github.com/internetarchive/wayback/tree/master/wayba...

https://akamhy.github.io/waybackpy/

https://wiki.archiveteam.org/index.php/Restoring

[-]

philipkglass an hour ago

Yes, there are documents and third party projects indicating that it has a free public API, but I haven't been able to get it to work. I presume that a paid API would have better availability and the possibility of support.

I just tried waybackpy and I'm getting errors with it too when I try to reproduce their basic demo operation:

  >>> from waybackpy import WaybackMachineSaveAPI
  >>> url = "https://nuclearweaponarchive.org"
  >>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
  >>> save_api = WaybackMachineSaveAPI(url, user_agent)
  >>> save_api.save()
  Traceback (most recent call last):
    File "<python-input-4>", line 1, in <module>
      save_api.save()
      ~~~~~~~~~~~~~^^
    File "/Users/xxx/nuclearweapons-archive/venv/lib/python3.13/site-packages/waybackpy/save_api.py", line 210, in save
      self.get_save_request_headers()
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
    File "/Users/xxx/nuclearweapons-archive/venv/lib/python3.13/site-packages/waybackpy/save_api.py", line 99, in get_save_request_headers
      raise TooManyRequestsError(
      ...<4 lines>...
      )
  waybackpy.exceptions.TooManyRequestsError: Can not save 'https://nuclearweaponarchive.org'. Save request refused by the server. Save Page Now limits saving 15 URLs per minutes. Try waiting for 5 minutes and then try again.

[-]

toomuchtodo an hour ago

Reach out to patron services, support @ archive dot org. Also, your API limits will be higher if you specify your API key from your IA user versus anonymous requests when making requests.

BryantD 3 hours ago

They have come a very long way since the late 1990s when I was working there as a sysadmin and the data center was a couple of racks plus a tape robot in a back room of the Presidio office with an alarmingly slanted floor. The tape robot vendor had to come out and recalibrate the tape drives more often than I might have wanted.

[-]

textfiles 2 hours ago

There is a fundamental resistance to tape technology that exists to this day as a result of all those troubles.

mcpar-land 31 minutes ago

Is this some kind of copypasted AI output? There are unformatted footnote numbers at the end of many sentences.

[-]

NetOpWibby 23 minutes ago

I was thinking the same thing. No proofreading is a sure sign to me. I also feel like I've read parts of this before.

lysace 20 minutes ago

The IA needs perhaps not just more money, but also more talented people, IMO. I worry that it has stagnated, from a tech pov.

cowhax 2 hours ago

>And the rising popularity of generative AI adds yet another unpredictable dimension to the future survival of the public domain archive.

I'd say the nonprofit has found itself a profitable reason for its existence

brcmthrowaway 3 hours ago

Does IA do deduplication?

[-]

textfiles 2 hours ago

Not in the way I think you're talking about. The archive has always tried to maintain a situation where the racks could be pushed out of the door or picked up after being somewhere and the individual drives will contain complete versions of the items. We have definitely reached out to people who seem to be doing redundant work and ask them to stop or for permission to remove the redundant item. But that's a pretty curatorial process.

HumanOstrich 2 hours ago

Yes[1].

[1]: The Article, Paragraph 2

[-]

zxcvasd 41 minutes ago

heres the second paragraph in full:

"Here, amidst the repurposed neoclassical columns and wooden pews of a building constructed to worship a different kind of permanence, lies the physical manifestation of the "virtual" world. We tend to think of the internet as an ethereal cloud, a place without geography or mass. But in this building, the internet has weight. It has heat. It requires electricity, maintenance, and a constant battle against the second law of thermodynamics. As of late 2025, this machine—collectively known as the Wayback Machine—has archived over one trillion web pages.1 It holds 99 petabytes of unique data, a number that expands to over 212 petabytes when accounting for backups and redundancy.3"

can you help my small brain by pointing out where in this paragraph they talk about deduplication?

sltkr an hour ago

I don't think the article mentions anything about deduplication. Can you be less snarky and actually quote the relevant sentence?

schmuckonwheels an hour ago

Disappointed with the lack of pictures.

[-]

parttimelarry an hour ago

Probably because this looks more like a Deep Research agent "delving" into the infrastructure -- with a giant list of sources at the end. The Archive is not just a library; it is a service provider.