Whenever I read about such issues I always wonder why we all don’t make more use of BitTorrent. Why is it not the underlying protocol for much more stuff? Like container registries? Package repos, etc.
1. BitTorrent has a bad rep. Most people still associate it with just illegal download.
2. It requires slightly more complex firewall rules, and asking the network admin to put them in place might raise some eyebrow for reason 1. On very restrictive network, they might not want to allow them at all due to the fact that it opens the door for, well, BitTorrent.
3. A BitTorrent client is more complicated than an HTTP client, and not installed on most company computer / ci pipeline (for lack of need, and again reason 1.). A lot of people just want to `curl` and be done with it.
4. A lot of people think they are required to seed, and for some reason that scare the hell of them.
Overall, I think it is mostly 1 and the fact that you can just simply `curl` stuff and have everything working.
I do sadden me that people do not understand how good of a file transfer protocol BT is and how it is underused. I do remember some video game client using BT for updates under the hood, and peertube use webtorrent, but BT is sadly not very popular.
At a previous job, I was downloading daily legal torrent data when IT flagged me. The IT admin, eager to catch me doing something wrong, burst into the room shouting with management in tow. I had to calmly explain the situation, as management assumed all torrenting was illegal and there had been previous legal issues with an intern pirating movies. Fortunately, other colleagues backed me up.
Hey, ages ago, as an intern, I have been flagged for BitTorrent downloads. As it turned out, I was downloading/sharing Ubuntu isos, so things didn't escalate too far, but it was a scary moment.
I left a Linux ISO (possibly Ubuntu) seeding on a lab computer at university, and forgot about it after I'd burned the DVD. You can see this was a while ago.
A month later an IT admin came to ask what I might be doing with port 6881. Once I remembered, we went to the tracker's website and saw "imperial.ac.uk" had the top position for seeding, by far.
Fortunately, it was the nice way — that university is one of the backbone nodes of the UK academic network, so the bandwidth use was pretty much irrelevant.
Are you just theorizing, or is there precedent of this? I mean of lawyers sending cease and desist letters to people torrenting random encrypted streams of data?
Do companies send out C&Ds for your average user torrenting? I've gotten thousands of DMCA letters but never a C&D, and I've only ever heard of 1 person getting one, and they were silly enough to be hosting a collection of paid content that they scraped themselves, from their home.
DMCA demands are, as far as I'm aware, completely automated and couldn't really cost much.
You know what has a bad rep? Big companies that use and trade my personal information like they own it. I'll start caring about copyrights when governments force these big companies to care about my information.
To play devil's advocate, I think the author of the message was talking about the corporate context where it's not possible to install a torrent client; Microsoft Defender will even remove it as a "potentially unwanted program", precisely because it is mostly used to download illegal content.
Obviously illegal ≠ immoral, and being a free-software/libre advocate opposed to copyright, I am in favor of the free sharing of humanity's knowledge, and therefore supportive of piracy, but that doesn't change the perception in a corporate environment.
That will depend on how Defender is configured - in a corporate environment it may be set to be far more strict. In fact tools other than Defender are likely to be used, but these often get conflated with Defender in general discussions.
There are various common malware payloads that include data transfer tools (http proxies, bittorrent clients, etc.) - it isn't just password scanners, keyboard monitors, and crypto miners. These tools can be used for the transfer of further malware payloads, to create a mesh network so more directed hacking attempts are much more difficult to track, to host illegal or immoral content, or for the speedy exfiltration of data after a successful directed hack (perhaps a spear-phish).
Your use of the stuff might not be at all malware like, but in a corporate environment if it isn't needed it gets flagged as something to be checked up on in case it is not there for good reason. I've been flagged for some of the tools I've played with, and this is fine: I have legitimate use for that sort of thing in my dealings with infrastructure, there are flags ticked that say “Dave has good reason to have these tools installed, don't bother us about it again unless he fails to install security updates that are released for them”, and this is fine: I want those things flagged in case people who won't be doing the things I do end up with such stuff installed without there knowledge, so it can be dealt with (and they can be given more compulsory “don't just thoughtlessly click on every link in any email you receive, and carelessly type your credentials into resulting forms” training!).
Not in any country that is part of the big international IP agreements (Berne convention, Paris Act).
The only exception (sort of) is Switzerland. And the reason downloading copyrighted content you haven't bought for personal use is legal in Switzerland is because the government is essentially paying for it - there is a tax in Switzerland on empty media, the proceeds from which are distributed to copyright holders whose content is consumed in Switzerland, regardless of whether it is bought directly from the rights holder or otherwise.
Apparently the legal status of downloading copyrighted materials for personal use is also murky in Spain, where apparently at least one judge found that it is legal - but I don't know how solid the reasoning was or whether other judges would agree (being a civil law country, legal precedent is not binding in Spain to the same extent that it would be in the UK or USA).
> Not in any country that is part of the big international IP agreements (Berne convention, Paris Act).
Poland signed Berne convention in 1919, has "well regulated" copyright, but still downloading all media (except for software) for personal use is fully legal. Tax on "empty media" is in place as well.
Format shifting and personal copying are legal in Poland, but you as an individual still have to have legally obtained your original in the first place to exercise that right, and an illicit download certainly doesn't count. Taxing "empty media" is to compensate for those format shifting rights, but it doesn't cover renumeration for acquiring media in the first place (and indeed no EU member state could operate such a scheme - they are prohibited by EU Directive 2001/29 https://eur-lex.europa.eu/legal-content/EN/ALL/?uri=celex%3A...).
> Format shifting and personal copying are legal in Poland, but you as an individual still have to have legally obtained your original in the first place to exercise that right, and an illicit download certainly doesn't count.
Like everywhere else where personal copies are legal and you can download them. If both conditions are true, then the mere fact that you are downloading it, it's not a sign you are downloading pirated content.
OTOH there is also Spain where piracy with no direct monetary gain is tolerated and nobody goes after people torrenting.
Yes it is. As long as the content was intentionally distributed by the rights holder (for example a movie had its premiere) you can legally download it from anywhere and use it for your own (and your friends) enjoyment however you please. You can't make it (or even the content you bought) available to people who aren't your actual friends (random people on the internet).
That's the Polish law, both the letter and the implementation. On at least one occasion the police issued an official statement saying exactly that.
I think no one was ever fined in Poland for incidental upload while using bittorrent protocol to download. There are high profile cases for people who where publishing large amounts of media files, especially commercially. Little more than a decade ago there was one case where some company tried to go after bittorrent downloaders of 3 specific Polish movies. But I think it was ultimately thrown out or cheaply settled because no case like that has been publicized ever since and everybody who knows how to use bittorent, does.
Again, it covers everything except for software that has more restrictive laws more similar to what you think the law is.
Tax on empties was set up long time ago to support creators who's music is shared among friends directly. It's was not intended to compensate for downloads. I think only Polish artists receive any money from this (I might be wrong on that) and the organization that distributes the money is highly inefficient. They tried to extend the tax to electronic devices, but nobody likes them, companies and people both, so they didn't get too far with this proposal for now.
Poland enjoys a lot of digital freedoms and is conscious of them and ready to defend them against ACTA, Chat Control and extend them with Stop Killing Games.
There is also the US. It is legal to download movies in the United States. You can, however, get dinged by the automated lawsuit or complaint bots for uploading them, which makes torrenting without a vpn less than ideal.
Finding examples of people getting successfully sued for downloading or viewing movies without sharing them should be trivial, then.
Otherwise, less any examples of enforcement or successful legal action, downloading movies is illegal in the US in the same way that blasphemy is illegal in Michigan.
It's absolutely illegal in the USA to download a movie (or game, or song, or book, etc) that you haven't bought [0].
It could be argued that if you bought a movie, say on DVD, downloading another copy of it from an online source could fall under fair use, but this is more debatable.
I am not aware of a single example of someone getting successfully sued for downloading a movie. Every lawsuit that I’m aware of (even going back to the Napster days) people got sued for sharing content using p2p software. The current lawsuit robots download a torrent and then wait for individual IPs to upload some chunk of a copyrighted file to them, which they then use as proof of somebody sharing copyrighted material for their complaint.
Even the Protecting Lawful Streaming Act of 2020 explicitly does not punish consumers of copyrighted content, only its distributors.
>Tillis stated that the bill is tailored to specifically target the websites themselves, and not "those who may use the sites nor those individuals who access pirated streams or unwittingly stream unauthorized copies of copyrighted works"
There are so many paragraphs in response to my “You can’t get in trouble for downloading movies in the US” post and none of them have any examples of people getting in trouble for downloading movies in the US.
None. Because you projected your country's laws in the discussion, you failed to see that the countries that allow copyrighted material to be downloaded for personal usage do not qualify that download as "copyright infringement" in the first place.
To answer your question with the only answer I know: Switzerland.
A download is a copy of a work. So, downloading a movie is making a copy of a work that you are not a copyright holder of - in other words, either you or the site you are downloading from are infringing on the copyright holder's exclusive right to create copies of their work. You could claim there is some fair use exemption for this case, or you can have an alternative way of authorizing copies and paying for them like Switzerland does, but there is no doubt in any legal system that downloading is the same kind of action as copying a book at a print shop.
I love how enthusiastic this post is while being wrong.
Making a copy of a thing does not violate copyright (eg you can photocopy a book that you possess even temporarily). Sharing a copy that you made can violate copyright.
It is like mixing up “it’s illegal to poison somebody with bleach” and “it’s illegal to own bleach”. The action you take makes a big difference
Also, as an aside, when you view a legitimately-purchased and downloaded video file that you have license to watch, the video player you use makes a copy from the disk to memory.
If I own a license to listen to Metallica - Enter Sandman.m4a that I bought on iTunes and in the download folder I screw up and I make
Metallica - Enter Sandman(1).m4a
Metallica - Enter Sandman(2).m4a
Metallica - Enter Sandman(3).m4a
How much money do I owe Lars Ulrich for doing that based on The Law of The Earth Everywhere But Switzerland?
You're mixing up several things, all of which actually boil down to the fair use exceptions I was mentioning.
Making copies of a book you legally own for personal use is an established fair use exception to copyright. However, making copies of a book that you borrowed from a library would be copyright infringement. Similarly, lending the copies you've made of a book to friends would technically void the fair use exception for your copies.
The copy that a playback device has to make of a copyrighted audio/video file for its basic functioning is typically mentioned explicitly in the license you buy, thus being an authorized copy for a specific purpose. If you make several copies of a file on your own system for personal use, then again you are likely within fair use exemptions, similar to copying a book case - though this is often a bit more complicated legally by the fact that you don't own a copy but a license to use the work in various ways, and some companies' licenses can theoretically prohibit even archival copies, which in turn may or may not be legal in various jurisdictions.
But in no jurisdiction is it legal to, for example, go with a portable photocopy machine into a bookstore and make copies of books you find in there, even if they are only for personal use: you first have to legally acquire an authorized copy from the rights holder. All other exemptions apply to what you do with that legally obtained copy.
This even means that you don't have any rights to use a fraudulent copy of a work, even if you legitimately believed you were obtaining a legal copy. For example, say a library legally bought a book from a shady bookstore that, unbeknownst to them, was selling counterfeit copies of a book. If the copyright holder finds out, they can legally force the library to pay them to continue offering this book, or to destroy it otherwise, along with any archival copies that they had made of this book. The library can of course seek to obtain reparations from the store that sold them the illegal copy, but they can't refuse to pay the legal copyright holder.
> I love how enthusiastic this post is while being wrong.
This is a very funny thing to say given that post is entirely correct, while you are wrong.
> Making a copy of a thing does not violate copyright
Yes it does, unless it's permitted under a designated copyright exemption by local law. For instance, you mention that the video player makes a copy from disk to memory, well that is explicitly permitted by Article 5(1) of the Copyright Directive 2001 in the EU as a use that is "temporary, transient or incidental and an integral and essential part of a technological process", as otherwise it would be illegal as by default, any action to copy is a breach of copyright. That's literally where the word comes from.
> If I own a license to listen to Metallica - Enter Sandman.m4a that I bought on iTunes and in the download folder I screw up and I make
> Metallica - Enter Sandman(1).m4a
> Metallica - Enter Sandman(2).m4a
> Metallica - Enter Sandman(3).m4a
In legal terms you do indeed owe him something, yes. It would probably be covered under the private copy exemptions in some EU territories, but only on the basis that blank media is taxed to pay rightsholders a royalty for these actions under the relevant collective management associations.
I assume this is Germany. Usually you can haggle it down to the low hundreds if it's the first time and you show you're just a regular young person with not much income.
First off it was like 2 months after my father's death we didnt have time for this, secondly my mom got an attorney that I paid. Was roughly the same amount though. We never paid them.
> It requires slightly more complex firewall rules, and asking the network admin to put them in place might raise some eyebrow for reason 1
Well, in many such situations data is provided for free, putting huge burden on the other side. Even it it's a little bit less convenient it makes service a lot more sustainable. I imagine torrent for free tier and direct download as a premium option would work perfectly
I wish... I overpay more than double market value for my connection, and am not allowed to configure my router. This is the norm for most apartment dwellers in the US as far as I'm aware.
5. Most residential internet connections are woefully underprovisioned for upload so anything that uses it more (and yes you need people to seed for bittorrent to make sense) can slow down the entire connection.
6. Service providers have little control over the service level of seeders and thus the user experience. And that's before you get malicious users.
seeding is uploading after you are done downloading.
but you are already uploading while you are still downloading. and that can't be turned off. if seeding scares someone, then uploading should scare them too. so they are right, because they are required to upload.
If you are in the scene long enough, you should have known that there are some uncooperative clients that always send 0% (Xunlei being one of the more notorious example with their VIP schemes, and later on they would straight up spoof their client string when people started blocking them). Being a leecher nowadays is almost a norm for a lot of users, and I don't blame them since they are afraid of consequences in a more regulated jurisdiction. But a must seed when leech requirement? Hoho no, that's more like a suggestion.
> but you are already uploading while you are still downloading. and that can't be turned off
Almost every client let you set uploading limit, which you can set at 0. The only thing that generate upload bp usage that cannot be deactivated would be protocol stuff (but you can deactivate part of bt like using the DHT).
I think it should be up to the client to decide whether they want to seed. As another commenter mentioned, it could be for legal reasons. Perhaps downloading in that jurisdiction is legal but uploading is not. Perhaps their upload traffic is more expensive.
Now, as a seeder, you may still be interested in those clients being able to download and reach whatever information you are seeding.
In the same vein, as a seeder, you may just not serve those clients. That's kind of the beauty of it. I understand that there may be some old school/cultural "code of conduct" but really this is not a problem with a behavioral but instead with a technical solution that happens to be already built-in.
I think it should be up to the client to decide whether they want to seed
well, yes and no. legal issues aside (think about using bittorrent only for legal stuff), the whole point of bittorrent is that it works best if everyone uploads.
actually, allowing clients to disable uploading is almost an acknowledgement that illegal uses should be supported, because there are few reasons why legal uses should need to disable uploading.
and as an uploader i also don't want others not to upload. so while disabling upload is technically possible, it is also reasonable and not unlikely that connections from such clients could be rejected.
In the US, data caps are one reason to be stingy about seeding even if legality is not an issue. In that case though the user could still do something like limit the upload bandwidth while still seeding long-term, so their contribution is to provide availability and prevent a situation where no full seeds exist.
Some BitTorrent clients make it easier to set than others, but if it's a healthy torrent I often limit upload rate to so slow that it doesn't transfer anything up. Ratio is 0.00 and I still get 100s of mb/s.
Webtorrent exists. It uses webrtc to let users connect to each other. There's support in popular trackers.
This basically handles every problem stated. There's nothing to install on computers: it's just js running on the page. There's no firewall rules or port forwarding to setup, all handled by the stun/turn in webrtc. Users wouldn't necessarily even be aware they are uploading.
STUN is not always possible and TURN means proxying the connection through a server which would be counter-productive for the purpose of using bit-torrent as an alternative to direct HTTP downloads as you are now paying for the bandwidth in both directions. This is very much not a problem with magic solutions.
Agreed! But STUN's success rate is pretty good! As the number of peers goes up it should be less likely that one would need to use TURN to connect to a peer, but I am skeptical webrtc is built to fall back like this, to try other peers first.
The advantage is that at least it's all builtin. It's not a magic solution, but it's a pretty good solution, with fallbacks builtin for when the networking gets in the way of the magic.
Amazon, Esri, Grab, Hyundai, Meta, Microsoft, Precisely, Tripadvisor and TomTom, along with 10s of other businesses got together and offer OSM data in Parquet on S3 free of charge. You can query it surgically and run analytics on it needing only MBs of bandwidth on what is a multi-TB dataset at this point. https://tech.marksblogg.com/overture-dec-2024-update.html
As someone who works with mapping data for HGV routing, I've been keeping an eye on Overture. I wonder do you know if anyone has measured the data coverage and quality between this and proprietary datasets like HEREmaps? Does Overture supplement OSM road attributes (such as max height restrictions) where they can find better data from other sources?
I haven't done any deep dives into their road data but there was ~80 GB of it, mostly from TomTom, in the August release. I think the big question would be how much overlap there is with HERE and how would the metadata compare.
If you have QGIS running, I did a walkthrough using the GeoParquet Downloader Plugin with the 2.75B Building dataset TUM released a few weeks ago. It can take any bounding box you have your workspace centred on and download the latest transport layers for Overture. No need for a custom URL as its one of the default data sources the plugin ships with. https://tech.marksblogg.com/building-footprints-gba.html
Thanks for the response. There must be value in benchmarking data coverage and quality for routing data such as speed limits, vehicle restrictions, hazardous cargo etc... . I guess the problem is what do you benchmark against.
I remember seeing the concept of "torrents with dynamic content" a few years ago, but apparently never became a thing[1]. I kind of wish it did, but I don't know if there are critical problems (i.e. security?).
I assume it’s simply the lack of the inbuilt “universal client” that http enjoys, or that devs tend to have with ssh/scp. Not that such a client (even an automated/scripted CLI client) would be so difficult to setup, but then trackers are also necessary, and then the tooling for maintaining it all. Intuitively, none of this sounds impossible, or even necessarily that difficult apart from a few tricky spots.
I think it’s more a matter of how large the demand is for frequent downloads of very large files/sets, which leads to a questions of reliability and seeding volume, all versus the effort involved to develop the tooling and integrate it with various RCS and file syncing services.
Would something like Git LFS help here? I’m at the limit of my understanding for this.
Are you serious? Most Debian ISO mirrors I've used have 10gig connectivity and usually push a gigabit or two fairly easily. BitTorrent is generally a lot slower than that (it's a pretty terrible protocol for connecting you to actually fast peers and getting stuff quickly from them).
I've definitely seen higher speeds with BitTorrent, pretty easily maxing out my gbe nics, but I'm not downloading Debian images specifically with much frequency.
I used to work at a company that had to deliver huge files to every developer every week. At some point they switch from a thundering herd of rsyncs to using BitTorrent. The speed gains were massive.
It became disliked because of various problems and complaints, but mainly disappeared because Blozzard got the bright idea of preloading the patchset, especially to new expansions, in the weeks before. You can send down a ten gig patch a month before release, and then patch that patch a week before release, and a final small patch on the day before release, and everything is preloaded.
The great Carboniferous explosion of CDNs inspired by Netflix and friends has also greatly simplified the market, too.
https://github.com/uber/kraken exists, using a modified BT protocol, but unless you are distributing quite large images to a very large number of nodes, a centralized registry is probably faster, simpler and cheaper
To use bittorrent, your machine has to listen, and otherwise be somehow reachable. In many cases, it's not a given, and sometimes not even desirable. It sticks out.
I think a variant of bittorrent which may be successful in corporate and generally non-geek environments should have the following qualities:
- Work via WebSockets.
- Run in browser, no installation.
- Have no rebel reputation.
It's so obvious that it must have been implemented, likely multiple times. It would not be well-known because the very purpose of such an implementation would be to not draw attention.
WebTorrent is ubiquitous by now and also supported by Brave and many torrent clients. There is still much room to build, though. Get in, the water is warm!
In my experience the official software was very buggy and unreliable. Which isn't great for something about making immutable data live forever. I had bugs with silent data truncation, GC deleting live paths and the server itself just locking up and not providing anything it had to the network.
The developers always seemed focus on making new versions of the protocols with very minor changes (no more protocol buffers, move everything to CBOR) rather than actually adding new features like encryption support or making it more suitable for hosting static sites (which seems to have been on of its main niches).
It also would have been a great too for package repositories and other open source software archives. Large distros tend to have extensive mirror lists but you need to configure them, find out which ones have good performance for you and you can still only download from one mirror at a time. Decentralizing that would be very cool. Even if the average system doesn't seed any of the content the fact that anyone can just mirror the repo and downloads automatically start pulling from them was very nice. It also makes the download resilient to any official mirror going down or changing URL. The fact that there is strong content verification built in is also great. Typically software mirrors need to use additional levels of verification (like PGP signatures) to avoid trusting the mirror.
I really like the idea, and the protocol is pretty good overall. But the implementation and evolution really didn't work well in my opinion. I tried using it for a long time, offering many of my sites over it and mirroring various data. But eventually I gave up.
And maybe controversially it provided no capabilities for network separation and statistics tracking. This isn't critical for success but on entrypoint to this market is private file sharing sites. Having the option to use these things could give it a foot in the door and get a lot more people interested in development.
Hopefully the next similar protocol will come at some point, maybe it will catch on where IPFS didn't.
I used IPFS several years ago to get some rather large files from a friend, who had recently been interested in IPFS. From what I recall it took a full week or so to start actually transferring the files. It was so slow and finicky to connect. Bittorrent is dramatically easier to use, faster, and more reliable. It was hard to take IPFS seriously after that. I also recall an IRC bot that was supposed to post links to memes at IPFS links and they were all dead, even though it's supposed to be more resilient. I don't have the backstory on that one to know how/why the links didn't work.
I wouldn't expect that to hold up any more than a silly idea I had (probably not original) a while back of "Pi-Storage".
The basic idea being, can you using the digits of Pi to encode data, or rather, can you find ranges of Pi that map to data you have and use it for "compression".
A very simple example, let's take this portion of Pi:
Then let's say we have a piece of data that, when encoded and just numbers, results in: 15926535897997169626433832
Can we encode that as: 4–15, 39–43, 21–25, 26–29 and save space? The "compression" step would take a long time (at some point you have to stop searching for overlap as Pi goes on for forever).
Anyways, a silly thought experiment that your idea reminded me of.
Is C really "pure noise" if you can get A back out of it?
It's like an encoding format or primitive encryption, where A is merely transformed into unrecognizable data, meaningful noise, which still retains the entirety of the information.
> Is C really "pure noise" if you can get A back out of it?
If you throw out B, then there's no possible way to get A out of C (short of blindly guessing what A is): that's one of the properties of a one-time pad.
But distributing both B and C is no different than distributing A in two parts, and I'd have a hard time imagining it would be treated any differently on a legal level.
Thanks for engaging in this discussion and explaining your thoughts. I'm still trying to understand.
> given ciphertext C = A ^ B you can decrypt plaintext A using key B
Right, C ^ B = A, to get back the original.
Already here, it seems to me that C is not pure random noise if A can be recovered from it. C contains the encoded information of A, so it's not random. Is that wrong?
---
> or you can decrypt plaintext D using key E = C ^ D
In this example, isn't E the cyphertext and C is the key? E (cyphertext) = D (original) ^ C (key).
Then E ^ C = D, to get back the original.
Here, it seems to me that E contains the encoded information of D, so it's not random. And C plays the role of a key similar to B, so it's not being decoded as a ciphertext in this case, and nothing is implied whether it's random or not.
> XOR is commutative so for a one-time pad key and ciphertext can be swapped.
Maybe this is what I'm missing. In the second example, I'll swap the ciphertext and key:
C ^ E = D
Hmm, so both C and E are required to recover D the original information. Does that mean that the information was somehow distributed into C and E, so that they are both meaningful data, not random?
---
But what about B in the first example, a key that was supposed to be pure random? If I swap the ciphertext and key:
B ^ C = A
The information in A is recovered from the interaction of both B and C. So the entirety of the information is not in C, the ciphertext (or key in this case).
Does that mean, B is no longer pure noise? When it was used as a key, it became meaningful in relation to A. That's information, isn't it?
What is the point of this? If you think you can mount an adequate defense based on xor in a court of law, then you are sorely mistaken. Any state attorney will say infringement with an additional step of obfuscation is still infringement, and any judge will follow that assessment.
From a network point of view, BitTorrent is horrendous. It has no way of knowing network topology which frequently means traffic flows from eyeball network to eyeball network for which there is no "cheap" path available (potentially causing congestion of transit ports affecting everyone) and no reliable way of forecasting where the traffic will come from making capacity planning a nightmare.
Additionally, as anyone who has tried to share an internet connection with someone heavily torrenting, the excessive number of connections means overall quality of non-torrent traffic on networks goes down.
Not to mention, of course, that BitTorrent has a significant stigma attached to it.
The answer would have been a squid cache box before, but https makes that very difficult as you would have to install mitm certs on all devices.
For container images, yes you have pull through registries etc, but not only are these non-trivial to setup (as a service and for each client) the cloud providers charge quite a lot for storage making it difficult to justify when not having a check "works just fine".
The Linux distros (and CPAN and texlive etc) have had mirror networks for years that partially addresses these problems, and there was an OpenCaching project running that could have helped, but it is not really sustainable for the wide variety of content that would be cached outside of video media or packages that only appear on caches hours after publishing.
BitTorrent might seem seductive, but it just moves the problem, it doesn't solve it.
> From a network point of view, BitTorrent is horrendous. It has no way of knowing network topology which frequently means traffic flows from eyeball network to eyeball network for which there is no "cheap" path available...
As a consumer, I pay the same for my data transfer regardless of the location of the endpoint though, and ISPs arrange peering accordingly. If this topology is common then I expect ISPs to adjust their arrangements to cater for it, just the same as any other topology.
Two eyeball networks (consumer/business ISPs) are unlikely to have large PNIs with each other across wide geographical areas to cover sudden bursts of traffic between them. They will, however, have substantial capacity to content networks (not just CDNs, but AWS/Google etc) which is what they will have built out.
BitTorrent turns fairly predictable "North/South" traffic where capacity can be planned in advance and handed off "hot potato" as quickly as possible, into what is essentially "East/West" with no clear consistency which would cause massive amounts of congestion and/or unused capacity as they have to carry it potentially over long distances they have not been used to, with no guarantee that this large flow will exist in a few weeks time.
If BitTorrent knew network topology, it could act smarter -- CDNs accept BGP feeds from carriers and ISPs so that they can steer the traffic, this isn't practical for BitTorrent!
> If BitTorrent knew network topology, it could act smarter -- CDNs accept BGP feeds from carriers and ISPs so that they can steer the traffic, this isn't practical for BitTorrent!
AFAIK this has been suggested a number of times, but has been refused out of fears of creating “islands” that carry distinct sets of chunks. It is, of course, an non-issue if you have a large number of fast seeds around the world (and if the tracker would give you those reliably instead of just a random set of peers!), but that really isn't what BT is optimized for in practice.
Exactly. As it happens, this is an area I'm working on right now -- instead of using a star topology (direct), or a mesh (BitTorrent), or a tree (explicitly configured CDN), to use an optimistic DAG. We'll see if it gets any traction.
bittorrent will make best use of what bandwidth is available. better think of it as a dynamic cdn which can seamlessly incorporate static cdn-nodes (see webseed).
it could surely be made to care for topology but imho handing that problem to congestion control and routing mechanisms in lower levels works good enough and should not be a problem.
> bittorrent will make best use of what bandwidth is available.
At the expense of other traffic. Do this experiment: find something large-ish to download over HTTP, perhaps an ISO or similar from Debian or FreeBSD. See what the speed is like, and try looking at a few websites.
Now have a large torrent active at the same time, and see how slow the HTTP download drops to, and how much slower the web is. Perhaps try a Twitch stream or YouTube video, and see how the quality suffers greatly and/or starts rebuffering.
Your HTTP download uses a single TCP connection, most websites will just use a single connection also (perhaps a few short-duration extra connections for js libraries on different domains etc). By comparison, BitTorrent will have dozens if not hundreds of connections open and so instead of sharing that connection in half (roughly) it is monopolising 95%+ of your connection.
The other main issue I forgot to mention is that on most cloud providers, downloading from the internet is free, uploading to the internet costs a lot... So not many on public cloud are going to want to start seeding torrents!
If your torrent client is having a negative effect on other traffic then use its bandwidth limiter.
You can also lower how many connections it makes, but I don't know anyone that's had need to change that. Could you show us which client defaults to connecting to hundreds of peers?
My example was to show locally what happens -- the ISP does not have control over how many connections you make. I'm saying that if you have X TCP connections for HTTP and 100X TCP connections for BitTorrent, the HTTP connections will be drowned out. Therefore, when the link at your ISP becomes congested, HTTP will be disproportionately affected.
For the second question, read the section on choking at https://deluge-torrent.org/userguide/bandwidthtweaking/ and Deluge appears to set the maximum number of connections per torrent of 120 with a global max of 250 (though I've seen 500+ in my brief searching, mostly for Transmission and other clients).
I'll admit a lot of my BitTorrent knowledge is dated (having last used it ~15 years ago) but the point remains: ISPs are built for "North-South" traffic, that is: To/From the customer and the networks with the content, not between customers, and certainly not between customers of differing ISPs.
Interesting... It's been ~15 years since I last used BitTorrent personally, and I had asked a friend before replying and they swore that all their traffic was TCP -- though perhaps that may be due to CGNAT or something similar causing that fallback scenario you describe.
Thanks for the info, and sorry for jumping to a conclusion! Though my original point stands: Residential ISPs are generally not built to handle BitTorrent traffic flows (customer to customer or customer to other-ISP-customer across large geographic areas) so the bursty nature would cause congestion much easier, and BitTorrent itself isn't really made for these kinds of scenarios where content changes on a daily basis. CDNs exist for a reason, even if they're not readily available at reasonable prices for projects like OP!
The number of connections isn’t relevant. A single connection can cause the same problem with enough traffic. Your bandwidth is not allocated on a per-connection basis.
If you download 2 separate files over HTTP, you'd expect each to get roughly 1/2 of the available bandwidth at the bottleneck.
With 1 HTTP connection downloading a file and 100 BitTorrent connections trying to download a file, all trying to compete, you'll find the HTTP throughput significantly reduced. It's how congestion control algorithms are designed: rough fairness per connection. That's why the first edition of BBR that Google released was unpopular, it stomped on other traffic.
I had the same thoughts for some time now. It would be really nice to distribute software and containers this way. A lot of people have the same data locally and we could just share it.
This is a feature, not a bug. Torrent file/magnet link contains hash of a data which is immutable. Just publish new link (you should anyway, even with http)
That's useful if you did not prepare for that eventuality. But if you do want to store data long-term it's better to generate error-correcting codes like PAR2 so that you can actually recover partial errors without having to rely on seeders being available - and when you do that you no longer need the torrent for verification.
Imagine if for websites you had to get a brand new domain if you ever wanted to change the contents of your web page. You can't just go to google.com because it would be immutable. You would have to somehow know you have to go to google10393928.com and any links to google on the internet would be linking to some old version. The ability to have a link always refer to the latest version is useful. The same applies to torrents. It's possible for the magnet link to the latest version to get lost and then a bunch of people are accidently downloading old, worse versions of the file with no way to find the newest version.
I don't need to imagine domains do not exist. They exist. OP brought in domains as an example.
For some you want a name where the underlying resource can change, for others you want a hash of the actual resource. Which one you want depends on the application.
> I don't need to imagine domains do not exist. They exist. OP brought in domains as an example.
In the context of webpages, a domain lets you deploy new versions.
With a torrent file, a domain does not let you do that.
Please try to understand the comparison they're making instead of just saying "domains are not hashes" "domains do exist".
> For some you want a name where the underlying resource can change, for others you want a hash of the actual resource. Which one you want depends on the application.
Right.
And torrents don't give you the choice.
Not having the choice is much closer to "bug" than "feature".
Needing a new magnet link is fine. The old magnet link working indefinitely is great. Having no way to get from old magnet to new magnet is not as fine.
The OP brought in domains as an example but domains are not applicable.
Everybody on HN should know how a domain works. I think most people on HN understand what a hash is and how a magnet link works. The fact that you can't easily replace the resource under a magnet link is a feature not a bug. If you think for a bit about the consequences of what would happen if you could easily replace the resources associated with a magnet link rather than just having the 'convenience of being able to update a torrent' and you'll see that this is not a simple thing at all.
Torrents are simply a different thing than 'the web' and to try to equate the one to the other is about as silly as trying to say that you can't use a screwdriver to put nails in the wall. They're different things. Analogies are supposed to be useful, not a demonstration of your complete lack of understanding of the underlying material.
I distribute some software that I wrote using a torrent with a magnet link, so I'm well aware of the limitations there, but these limitations are exactly why I picked using a torrent in the first place.
It's a straw man because no one is telling you to replace domains with torrents. You'd replace the download link to https://yourwebsite.com/files/download-1.0.zip with https://yourwebsite.com/files/download-1.0.zip.torrent or the magnet URL corresponding to that file. Even if you wanted to replace HTTP with torrents entirely then the domain would be updated to point to the current torrent - after all the whole point of domains is that you have a memorable and persistent name that can be resolved to the current location of a service.
It's not all about how you distribute content. We must also decide which content do distribute, and that is a hard problem.
The most successful strategy so far has been moderation. Moderation requires hierarchical authority: a moderator who arbitrarily determines which data is and is not allowed to flow. Even bittorrent traffic is moderated in most cases.
For data to flow over bittorrent, two things must happen:
1. There must be one or more seeders ready to connect when the leecher starts their download.
2. There must be a way for a prospective leecher to find the torrent.
The best way to meet both of these needs is with a popular tracker. So here are the pertinent questions:
1. Is your content fit for a popular tracker? Will it get buried behind all the Disney movies and porn? Does it even belong to an explicit category?
If not, then you are probably going to end up running your own tracker. Does that just mean hosting a CDN with extra steps? Cloud storage is quite cheap, and the corporate consolidation of the internet by Cloudflare, Amazon, etc. has resulted in a network infrastructure that is optimized for that kind of traffic, not for bittorrent.
2. Is a popular tracker a good fit for your content? Will your prospective downloaders even think to look there? Will they be offended by the other content on that tracker, and leave?
Again, a no will lead to you making your own tracker. Even in the simplest case, will users even bother to click your magnet link, or will they just use the regular CDN download that they are used to?
So what about package repos? Personally, I think this would be a great fit, particularly for Nix, but it's important to be explicit about participation. Seeding is a bad default for many reasons, which means you still need a relatively reliable CDN/seed anyway.
---
The internet has grown into an incredibly hierarchical network, with incredibly powerful and authoritative participants. I would love to see a revolution in decentralized computing. All of the technical needs are met, but the sociopolitical needs need serious attention. Every attempt at decentralized content distribution I have seen has met the same fate: drowned in offensive and shallow content by those who are most immediately excited to be liberated from authority. Even if it technically works, it just smells too foul to use.
I propose a new strategy to replace moderation: curation. Instead of relying on authority to block out undesirable content, we should use attested curation to filter in desirable content.
Want to give people the option to browse an internet without porn? Clearly and publicly attest which content is porn. Don't light the shit on fire, just open the windows and let it air out.
People like Geofabrik are why we can (sometimes) have nice things, and I'm very thankful for them.
Level of irresponsibility/cluelessness you can see from developers if you're hosting any kind of an API is astonishing, so downloads are not surprising at all...If someone, a couple of years back, told me things that I've now seen, I'd absolutely dismiss them as making stuff up and grossly exaggerating...
However, on the same token, it's sometimes really surprising how API developers rarely ever think in terms of multiples of things - it's very often just endpoints to do actions on single entities, even if nature of use-case is almost never on that level - so you have no other way than to send 700 requests to do "one action".
> Level of irresponsibility/cluelessness you can see from developers if you're hosting any kind of an API is astonishing
This applies to anyone unskilled in a profession. I can assure you, we're not all out here hammering the shit out of any API we find.
With the accessibility of programming to just about anybody, and particularly now with "vibe-coding" it's going to happen.
Slap a 429 (Too Many Requests) in your response or something similar using a leaky-bucket algo and the junior dev/apprentice/vibe coder will soon learn what they're doing wrong.
> Slap a 429 [...] will soon learn what they're doing wrong.
Oh how I wish this was true. We have customers sending 10-100s requests per second and they will complain if even just one gets 429. As in, they escalate to their enterprise account rep. I always tell them to buy the customer a "error handling for dummies" book but they never learn.
Another commenter said essentially the same thing, I sympathise, it's painful when the "customer" can't understand something clearly telling them they're doing it wrong.
I don't have an answer, but I wonder, for the sake of protecting your own org, is some sort of "abuse" policy the only approach; as in, deal with the human problem, be clear with them in writing somewhere that if they're seeing response X or Y (429 etc) that they're essentially abusing the API and need to behave.
The only thing that reliably works is to push the cost to the customer - so they can do whatever insanity they want, and they get charged accordingly.
And we’ve had cases where we had “fuckoff” charges in contracts (think $1 per API call after X thousand calls a day) and the customer just gladly pays tens of thousands of dollars and thanks us for fixing their problem.
The money is nice but sometimes you just want to shake them and say “look we have notifications you don’t need to poll the endpoint ten times a second, fuck it give us the code and we’ll fix it …”
Thanks for the reply - I did not mean to rant, but, unfortunately, this is in context of a B2B service, and the other side are most commonly IT teams of customers.
There are, of course, both very capable and professional people, and also kind people who are keen to react / learn, but we've also had situations where 429s result in complaints to their management how our API "doesn't work", "is unreliable" and then demanding refunds / threatening legal action etc...
One example was sending 1.3M update requests a day to manage state of ~60 entities, that have a total of 3 possible relevant state transitions - a humble expectation would be several requests/day to update batches of entities.
Not at all, I sympathise, we're all like minded people here!
> One example was sending 1.3M update requests a day to manage state of ~60 entities, that have a total of 3 possible relevant state transitions
> but we've also had situations where 429s result in complaints to their management how our API "doesn't work", "is unreliable" and then demanding refunds / threatening legal action etc
That's painful, and at this point, we're beyond technical solutions, this need human solutions. If they can't realise that they're rate limited because they're basically abusing the API, they need to be told in no uncertain terms.
Of course I understand that it's not that simple, as a backend dev, my "customers" are usually other devs so I can be frank, but when dealing with B2B customers we often have to act like they're not in the wrong.
But that is a question that should be escalated to management right? If they charge the customer enough that allowing them to make 1.3M requests to update 60 entities makes sense, why not let them?
If they want the service stupidly overprovisioned to deal with these nutjobs, then that’s what we’ll do. I find that they’re generally receptive to the message of $10k per month to serve nutjobs, $100 per month to serve everyone else, though.
That’s the key - have that in place from the beginning.
Because many “enterprise” customers can spend literally millions doing shit the absolute wrong way, but have $0 budget for a developer to make it work right.
I don't understand why features like S3's "downloader pays" isn't more widely used (and available outside AWS). Let the inefficient consumer bear their own cost.
Major downside is that this would exclude people without access to payment networks, but maybe you could still have a rate-limited free option.
Their download service does not require authentication, and they are kind enough to be hesitant about blocking IPs (one IP could be half of a university campus, for example). So that leaves chasing around to find an individual culprit and hoping they'll be nice and fix it.
They could however rate-limit per IP with possibly a whitelist or higher limits for cooperative universities that are willing to prevent abuse from their own network. Unless they are facing a DDOS but then this appeal is even less likely to help.
Sounds like someone people are downloading it in their CI pipelines. Probably unknowingly. This is why most services stopped allowing automated downloads for unauthenticated users.
Make people sign up if they want a url they can `curl` and then either block or charge users who download too much.
I'd consider CI one of the worst massive wastes of computing resources invented, although I don't see how map data would be subject to the same sort of abusive downloading as libraries or other code.
This stuff tends to happen by accident. Some org has an app that automatically downloads the dataset if it's missing, helpful for local development. Then it gets loaded in to CI, and no one notices that it's downloading that dataset every single CI run.
At some point wilful incompetence becomes malice. You really shouldn't allow network requests from your CI runners unless you have something that cannot be solved in another way (hint: you don't).
Let's say you're working on an app that incorporates some Italian place names or roads or something. It's easy to imagine how when you build the app, you want to download the Italian region data from geofabrik then process it to extract what you want into your app. You script it, you put the script in your CI...and here we are:
> Just the other day, one user has managed to download almost 10,000 copies of the italy-latest.osm.pbf file in 24 hours!
Whenever people complain about the energy usage of LLM training runs I wonder how this stacks up against the energy we waste by pointlessly redownloading/recompiling things (even large things) all the time in CI runs.
Optimising CI pipelines has been a strong aspect of my career so far.
Anybody can build a pipeline to get a task done (thousands of quick & shallow howto blog posts) but doing this efficiently so it becomes a flywheel rather than a blocker for teams is the hard part.
Not just caching but optimising job execution order and downstream dependencies too.
The faster it fails, the faster the developer feedback, and the faster a fix can be introduced.
I quite enjoy the work and always learning new techniques to squeeze extra performance or save time.
This is exactly it - you can cache all the wrong things easily, cache only the code you wanted changed, or cache nothing but one small critical file nobody knows about.
No wonder many just turn caching entirely off at some point and never turn it back on.
There is really no reason to add a JS dependency for this - whatever server-side component expires old URLs can just as well update the download page with the new one.
Some years ago I thought, no one would be stupid enough to download 100+ megabytes in their build script (which runs on CI whenever you push a commit).
It does if you are building on the same host with preserved state and didn't clean it.
There are lots of cases where people end up with with an empty docker repo at every CI run or regularly empty the repo because docker doesn't have any sort of intelligence space management (like LRU).
To get fine-grained caching you need to use cache-mounts, not just cache layers. But the cache export doesn't include cache mounts, therefore the docker github action doesn't export cache mounts to the CI cache.
"There have been individual clients downloading the exact same 20-GB file 100s of times per day, for several days in a row. (Just the other day, one user has managed to download almost 10,000 copies of the italy-latest.osm.pbf file in 24 hours!) Others download every single file we have on the server, every day."
This sounds like problem rate-limiting would easily solve. What am I missing. The page claims almost 10,000 copies of same file were downloaded by the same user
The server operator is able to count the number of downloads in a 24h period for an individual user but cannot or will not set a rate limit
Why not
Will the users mentioned above (a) read the operator's message on this web page and then (b) change their behaviour
I would be bet against (a) and therefore (b) as well
Geofabrik guy here. You are right - rate limiting is the way to go. It is not trivial though. We use an array of Squid proxies to serve stuff and Squid's built-in rate limiting only does IPv4. While most over-use comes from IPv4 clients it somehow feels stupid to do rate limiting on IPv4 and leave IPv6 wide open. What's more, such rate-limiting would always just be per-server which, again, somehow feels wrong when what one would want to have is limiting the sum of traffic for one client across all proxies... then again, maybe we'll go for the stupid IPv4-per-server-limit only since we're not up against some clever form of attack here but just against carelessness.
Oh hey, it's me, the dude downloading italy-latest every 8 seconds!
Maybe not, but I can't help but wonder if anybody on my team (I work for an Italian startup that leverages GeoFabrik quite a bit) might have been a bit too trigger happy with some containerisation experiments. I think we got banned from geofabrik a while ago, and to this day I have no clue what caused the ban; I'd love to be able to understand what it was in order to avoid it in the future.
I've tried calling and e-mailing the contacts listed on geofabrik.de, to no avail. If anybody knows of another way to talk to them and get the ban sorted out, plus ideally discover what it was from us that triggered it, please let me know.
Hey there dude downloading italy-latest every 8 seconds, nice to hear from you. I don't think I saw an email from you at info@geofabrik, could you re-try?
Do they email heavy users? We used Nominatim free api for geocoding addresses in 2012 and our email was required parameter. They mailed us and asked us to cache results to reduce request rates.
>Just the other day, one user has managed to download almost 10,000 copies of the italy-latest.osm.pbf file in 24 hours!
Whenever I have done something like that, it's usually because I'm writing a script that goes something like:
1. Download file
2. Unzip file
3. Process file
I'm working on step 3, but I keep running the whole script because I haven't yet built a way to just do step 3.
I've never done anything quite that egregious though. And these days I tend to be better at avoiding this situation, though I still commit smaller versions of this crime.
My solution to this is to only download if the file doesn't exist. An additional bonus is that the script now runs much faster because it doesn't need to do any expensive networking/downloads.
10,000 times a day is on average 8 times a second. No way someone has 8 fixes per second, this is more like someone wanted to download a new copy every day, or every hour but they messed up milliseconds config or something. Or it's simply malicious user.
Meta: the title could be clearer, e.g. "Download OSM Data Responsibly" would have helped me figure out the context faster as someone not familiar with the domain name shown.
I think the original title is fine, in this case if it's OSM or not doesn't really matter as it applies to many other instances of people downloading data in a loop for no good reason.
But that raises the complexity of hosting this data immensely. From a file + nginx you now need active authentication, issuing keys, monitoring, rate limiting...
Yes, this the the "right" solution but it is a huge pain and it would be nice if we could have nice things without needing to do all of this work.
Speaking as the person running it - introducing API keys would not be a big deal, we do this for a couple paid services already. But speaking as a person frequently wanting to download free stuff from somewhere, I absolutely hate having to "set up an account" just to download something once. I started that server well over a decade ago (long before I started the business that now houses it); the goal has always been first and foremost to make access to OSM data as straightforward as possible. I fear that having to register would deter many a legitimate user.
Yeah, I totally get it. In an ideal world we could just stick a file on an HTTP server and people would download it reasonably. Everything is simpler and happier this way.
Chances of those few people doing very large amounts of downloads reading this are quite small. Basic rate limits on IP level or some other simple fingerprinting would do a lot of good for those edge cases, as those folks are most likely not aware of this happening.
Yes, IP rate limiting is not perfect, but if they have a way of identifying which user is downloading the whole planet every single day, that same user can be throttled as well.
But rate-limiting public data is a huge pain. You can't really just have a static file anymore. Maybe you can configure your HTTP server to do IP based rate limiting but that is always ineffective (example public clouds where the downloader gets a new IP every time) or hits bystanders (a reasonable download the egresses out of the same IP or net block).
So if you really want to do this you need to add API keys and authentication (even if it is free) to reliably track users.
Even then you will have some users that find it easier to randomly pick from 100 API keys rather than properly cache the data.
I continue to be baffled the geofabrik folks remain the primary way to get a clean-ish OSM shapefile. Big XKCD "that one bloke holding up the internet" energy.
Shapefiles shouldn't be what you're after, Parquet can almost always do a better job unless you need to either edit something or use really advanced geometry not yet supported in Parquet.
>one user has managed to download almost 10,000 copies of the italy-latest.osm.pbf file in 24 hours!
Trash code most likely. Rate limit their IP. No one cares about people that do this kind of thing. If they're a VPN provider...then still rate limit them.
Ah, responsibility... The one thing we hate teaching and hate learning even more. Someone is probably downloading files in some automated pipeline. Nobody taught them that with great power (being able to write programs and run them on the internet) comes great responsibility. It's similar to how people drive while intoxicated or on the phone etc. It's all fun until you realise you have a responsibility.
I mean, at this point I wouldn't mind if they rate-limit downloads. A _single_ customer downloading the same file 10.000 times? Sorry, we need to provide for everyone, try again at some other point.
It is free, yes, but there is no need to either abuse it or give as much resource for free as they can.
See: "Also, when we block an IP range for abuse, innocent third parties can be affected."
Although they refer to IP ranges, the same principle applies on a smaller scale to a single IP address: (1) dynamic IP addresses get reallocated, and (2) entire buildings (universities, libraries, hotels, etc.) might share a single IP address.
Aside from accidentally affecting innocent users, you also open up the possibility of a DOS attack: the attacker just has to abuse the service from an IP address that he wants to deny access to.
More sophisticated client identification can be used to avoid that edge case, e.g. TLS fingerprints. They can be spoofed as well, but if the client is going through that much trouble, then they should be treated as hostile. In reality it's more likely that someone is doing this without realizing the impact they're having.
It could be slightly more sophisticated than that. Instead of outright blocking an entire IP range, set quotas for individual clients and throttle downloads exponentially. Add latency, cap the bandwidth, etc. Whoever is downloading 10,000 copies of the same file in 24 hours will notice when their 10th attempt slows down to a crawl.
It'll still suck for CI users. What you'll find is that occasionally someone else on the same CI server will have recently downloaded the file several times and when your job runs, your download will go slowly and you'll hit the CI server timeout.
It's not unreasonable for each customer to maintain a separate cache (for security reasons), so that each of them will download the file once.
Then it only takes one bad user on the same subnet to ruin the experience for everyone else. That sucks, and isn't working as intended, because the intent was to only punish the one abusive user.
Why put a load of money into infra and none into simple mitigations like rate limiting to prevent the kind of issues they are complaining they need the new infra for?
Seems to me like the better action would be to implement rate-limiting, rather than complain when people use your resource in ways you don't expect. This is a solved problem.
AI models are trained relatively rarely, so it's unlikely this would be very noticeable among all the regular traffic. Just the occasional download-everything every few months.
One would think so. If AI bros were sensible, responsible and intelligent.
However, the pratical evidence is to the contrary, AI companies are hammering every webserver out there, ignoring any kind of convention like robots.txt, re-downloading everything in pointlessly short intervals. Annoying everyone and killing services.
Sounds like someone has an internal deployment script with the origin mirror in the update URI. This kind of hammering is also common for folks prototyping container build recipes, so don't assume the person is necessarily intending anyone harm.
Unfortunately, the long sessions needed for file servers do make them an easy target for DoS, and especially if supporting quality of life features like media fast-forward/seek features. Thus, a large file server normally does not share a nimble website or API server (CDNs still exist for a reason.)
Free download API keys with a set data/time Quota and IP rate limit are almost always necessary (i.e. the connection slows down to 1kB/s after a set daily limit.) Ask anyone that runs a Tile server or media platform for the connection-count costs of free resources.
It is a trade-off, but better than the email-me-a-link solution some firms deploy. Have a wonderful day =3
> If you want a large region (like Europe or North America) updated daily, use the excellent pyosmium-up-to-date program
This is why naming matters. I never would have guessed in a million years that software named "pyosmium-up-to-date" does this function. Give it a better name and more people will use it.
Whenever I read about such issues I always wonder why we all don’t make more use of BitTorrent. Why is it not the underlying protocol for much more stuff? Like container registries? Package repos, etc.
I can imagine a few things :
1. BitTorrent has a bad rep. Most people still associate it with just illegal download.
2. It requires slightly more complex firewall rules, and asking the network admin to put them in place might raise some eyebrow for reason 1. On very restrictive network, they might not want to allow them at all due to the fact that it opens the door for, well, BitTorrent.
3. A BitTorrent client is more complicated than an HTTP client, and not installed on most company computer / ci pipeline (for lack of need, and again reason 1.). A lot of people just want to `curl` and be done with it.
4. A lot of people think they are required to seed, and for some reason that scare the hell of them.
Overall, I think it is mostly 1 and the fact that you can just simply `curl` stuff and have everything working. I do sadden me that people do not understand how good of a file transfer protocol BT is and how it is underused. I do remember some video game client using BT for updates under the hood, and peertube use webtorrent, but BT is sadly not very popular.
At a previous job, I was downloading daily legal torrent data when IT flagged me. The IT admin, eager to catch me doing something wrong, burst into the room shouting with management in tow. I had to calmly explain the situation, as management assumed all torrenting was illegal and there had been previous legal issues with an intern pirating movies. Fortunately, other colleagues backed me up.
Hey, ages ago, as an intern, I have been flagged for BitTorrent downloads. As it turned out, I was downloading/sharing Ubuntu isos, so things didn't escalate too far, but it was a scary moment.
So, I'm not using BT at work anymore.
I left a Linux ISO (possibly Ubuntu) seeding on a lab computer at university, and forgot about it after I'd burned the DVD. You can see this was a while ago.
A month later an IT admin came to ask what I might be doing with port 6881. Once I remembered, we went to the tracker's website and saw "imperial.ac.uk" had the top position for seeding, by far.
The admin said to leave running.
> The admin said to leave running.
This can be read in two wildly different ways.
Fortunately, it was the nice way — that university is one of the backbone nodes of the UK academic network, so the bandwidth use was pretty much irrelevant.
S3 had BitTorrent support for a long time...
"S3 quietly deprecates BitTorrent support" - https://news.ycombinator.com/item?id=27524549
At least the planet download offers BitTorrent. https://planet.openstreetmap.org/
So does Wikipedia https://meta.m.wikimedia.org/wiki/Data_dump_torrents
Truly the last two open web titans.
> A lot of people think they are required to seed, and for some reason that scare the hell of them.
Some of the reasons consists of lawyers sending put costly cease and desist letters even to "legitimate" users
For seeding map data?
Particularly if your torrent traffic is encrypted, they don't always bother to check what you are torrenting
Are you just theorizing, or is there precedent of this? I mean of lawyers sending cease and desist letters to people torrenting random encrypted streams of data?
When I was working in p2p research, we used to get regular C&Ds just for scraping the tracker (not even downloading anything!). IP holders be wilding
Appreciate your response, madness indeed!
Do companies send out C&Ds for your average user torrenting? I've gotten thousands of DMCA letters but never a C&D, and I've only ever heard of 1 person getting one, and they were silly enough to be hosting a collection of paid content that they scraped themselves, from their home.
DMCA demands are, as far as I'm aware, completely automated and couldn't really cost much.
Bad rep ...
You know what has a bad rep? Big companies that use and trade my personal information like they own it. I'll start caring about copyrights when governments force these big companies to care about my information.
Lol, bad rep? Interesting, in my country everybody is using it to download movies :D Even more so now, after this botched streaming war. (EU)
To play devil's advocate, I think the author of the message was talking about the corporate context where it's not possible to install a torrent client; Microsoft Defender will even remove it as a "potentially unwanted program", precisely because it is mostly used to download illegal content.
Obviously illegal ≠ immoral, and being a free-software/libre advocate opposed to copyright, I am in favor of the free sharing of humanity's knowledge, and therefore supportive of piracy, but that doesn't change the perception in a corporate environment.
What? Transmission never triggers any warning from Defender.
That will depend on how Defender is configured - in a corporate environment it may be set to be far more strict. In fact tools other than Defender are likely to be used, but these often get conflated with Defender in general discussions.
Wow, that's vile. U have many objections to this but they all boil down to M$ telling you what you cannot do with your own computer.
Most people want Microsoft preventing them from installing malware on their own computer.
Most software coming out of SV these days is malware by the original definition yet for some reason never gets blocked.
But it isn't malware by any stretch of the imagination.
There are various common malware payloads that include data transfer tools (http proxies, bittorrent clients, etc.) - it isn't just password scanners, keyboard monitors, and crypto miners. These tools can be used for the transfer of further malware payloads, to create a mesh network so more directed hacking attempts are much more difficult to track, to host illegal or immoral content, or for the speedy exfiltration of data after a successful directed hack (perhaps a spear-phish).
Your use of the stuff might not be at all malware like, but in a corporate environment if it isn't needed it gets flagged as something to be checked up on in case it is not there for good reason. I've been flagged for some of the tools I've played with, and this is fine: I have legitimate use for that sort of thing in my dealings with infrastructure, there are flags ticked that say “Dave has good reason to have these tools installed, don't bother us about it again unless he fails to install security updates that are released for them”, and this is fine: I want those things flagged in case people who won't be doing the things I do end up with such stuff installed without there knowledge, so it can be dealt with (and they can be given more compulsory “don't just thoughtlessly click on every link in any email you receive, and carelessly type your credentials into resulting forms” training!).
Which is exactly why it has a bad rep. In most people mind BitTorrent = illegal download.
downloading movies for personal use is legal in many countries.
Not in any country that is part of the big international IP agreements (Berne convention, Paris Act).
The only exception (sort of) is Switzerland. And the reason downloading copyrighted content you haven't bought for personal use is legal in Switzerland is because the government is essentially paying for it - there is a tax in Switzerland on empty media, the proceeds from which are distributed to copyright holders whose content is consumed in Switzerland, regardless of whether it is bought directly from the rights holder or otherwise.
Apparently the legal status of downloading copyrighted materials for personal use is also murky in Spain, where apparently at least one judge found that it is legal - but I don't know how solid the reasoning was or whether other judges would agree (being a civil law country, legal precedent is not binding in Spain to the same extent that it would be in the UK or USA).
> Not in any country that is part of the big international IP agreements (Berne convention, Paris Act).
Poland signed Berne convention in 1919, has "well regulated" copyright, but still downloading all media (except for software) for personal use is fully legal. Tax on "empty media" is in place as well.
No it isn't.
Format shifting and personal copying are legal in Poland, but you as an individual still have to have legally obtained your original in the first place to exercise that right, and an illicit download certainly doesn't count. Taxing "empty media" is to compensate for those format shifting rights, but it doesn't cover renumeration for acquiring media in the first place (and indeed no EU member state could operate such a scheme - they are prohibited by EU Directive 2001/29 https://eur-lex.europa.eu/legal-content/EN/ALL/?uri=celex%3A...).
> Format shifting and personal copying are legal in Poland, but you as an individual still have to have legally obtained your original in the first place to exercise that right, and an illicit download certainly doesn't count.
Like everywhere else where personal copies are legal and you can download them. If both conditions are true, then the mere fact that you are downloading it, it's not a sign you are downloading pirated content.
OTOH there is also Spain where piracy with no direct monetary gain is tolerated and nobody goes after people torrenting.
Yes it is. As long as the content was intentionally distributed by the rights holder (for example a movie had its premiere) you can legally download it from anywhere and use it for your own (and your friends) enjoyment however you please. You can't make it (or even the content you bought) available to people who aren't your actual friends (random people on the internet).
That's the Polish law, both the letter and the implementation. On at least one occasion the police issued an official statement saying exactly that.
I think no one was ever fined in Poland for incidental upload while using bittorrent protocol to download. There are high profile cases for people who where publishing large amounts of media files, especially commercially. Little more than a decade ago there was one case where some company tried to go after bittorrent downloaders of 3 specific Polish movies. But I think it was ultimately thrown out or cheaply settled because no case like that has been publicized ever since and everybody who knows how to use bittorent, does.
Again, it covers everything except for software that has more restrictive laws more similar to what you think the law is.
Tax on empties was set up long time ago to support creators who's music is shared among friends directly. It's was not intended to compensate for downloads. I think only Polish artists receive any money from this (I might be wrong on that) and the organization that distributes the money is highly inefficient. They tried to extend the tax to electronic devices, but nobody likes them, companies and people both, so they didn't get too far with this proposal for now.
Poland enjoys a lot of digital freedoms and is conscious of them and ready to defend them against ACTA, Chat Control and extend them with Stop Killing Games.
I'll tldr your reply: "yes, there are countries where it's legal to download for personal use".
Thank you.
There is also the US. It is legal to download movies in the United States. You can, however, get dinged by the automated lawsuit or complaint bots for uploading them, which makes torrenting without a vpn less than ideal.
> It is legal to download movies in the United States.
No it isn't.
It's not a criminal offense, but if someone can sue you for it and win then it isn't "legal" under any technical or popular definition of the word.
Finding examples of people getting successfully sued for downloading or viewing movies without sharing them should be trivial, then.
Otherwise, less any examples of enforcement or successful legal action, downloading movies is illegal in the US in the same way that blasphemy is illegal in Michigan.
https://www.legislature.mi.gov/Laws/MCL?objectName=MCL-750-1...
This is what I heard and experienced. Is the guy above just making shit up?
It's absolutely illegal in the USA to download a movie (or game, or song, or book, etc) that you haven't bought [0].
It could be argued that if you bought a movie, say on DVD, downloading another copy of it from an online source could fall under fair use, but this is more debatable.
[0] https://legalclarity.org/is-pirating-movies-illegal-what-are...
I am not aware of a single example of someone getting successfully sued for downloading a movie. Every lawsuit that I’m aware of (even going back to the Napster days) people got sued for sharing content using p2p software. The current lawsuit robots download a torrent and then wait for individual IPs to upload some chunk of a copyrighted file to them, which they then use as proof of somebody sharing copyrighted material for their complaint.
Even the Protecting Lawful Streaming Act of 2020 explicitly does not punish consumers of copyrighted content, only its distributors.
>Tillis stated that the bill is tailored to specifically target the websites themselves, and not "those who may use the sites nor those individuals who access pirated streams or unwittingly stream unauthorized copies of copyrighted works"
There are so many paragraphs in response to my “You can’t get in trouble for downloading movies in the US” post and none of them have any examples of people getting in trouble for downloading movies in the US.
This is a useless discussion. Imagine how the firewall-guy/network-team in your company will react to that argument.
This is not a useless discussion just because it'll inconvenience someone who is at work anyway.
How about the uploading part of it, which is behind the magic of Bittorrent and default mode of operation?
download yes, but using bittorrent means you are sharing, and that's not allowed in most countries even if downloading is.
Really?? Which countries allow copyright infringement by individuals?
None. Because you projected your country's laws in the discussion, you failed to see that the countries that allow copyrighted material to be downloaded for personal usage do not qualify that download as "copyright infringement" in the first place.
To answer your question with the only answer I know: Switzerland.
See above: there are a few, but it's not copyright infringement.
How is downloading a movie copyright infringement?
A download is a copy of a work. So, downloading a movie is making a copy of a work that you are not a copyright holder of - in other words, either you or the site you are downloading from are infringing on the copyright holder's exclusive right to create copies of their work. You could claim there is some fair use exemption for this case, or you can have an alternative way of authorizing copies and paying for them like Switzerland does, but there is no doubt in any legal system that downloading is the same kind of action as copying a book at a print shop.
I love how enthusiastic this post is while being wrong.
Making a copy of a thing does not violate copyright (eg you can photocopy a book that you possess even temporarily). Sharing a copy that you made can violate copyright.
It is like mixing up “it’s illegal to poison somebody with bleach” and “it’s illegal to own bleach”. The action you take makes a big difference
Also, as an aside, when you view a legitimately-purchased and downloaded video file that you have license to watch, the video player you use makes a copy from the disk to memory.
If I own a license to listen to Metallica - Enter Sandman.m4a that I bought on iTunes and in the download folder I screw up and I make
Metallica - Enter Sandman(1).m4a
Metallica - Enter Sandman(2).m4a
Metallica - Enter Sandman(3).m4a
How much money do I owe Lars Ulrich for doing that based on The Law of The Earth Everywhere But Switzerland?
You're mixing up several things, all of which actually boil down to the fair use exceptions I was mentioning.
Making copies of a book you legally own for personal use is an established fair use exception to copyright. However, making copies of a book that you borrowed from a library would be copyright infringement. Similarly, lending the copies you've made of a book to friends would technically void the fair use exception for your copies.
The copy that a playback device has to make of a copyrighted audio/video file for its basic functioning is typically mentioned explicitly in the license you buy, thus being an authorized copy for a specific purpose. If you make several copies of a file on your own system for personal use, then again you are likely within fair use exemptions, similar to copying a book case - though this is often a bit more complicated legally by the fact that you don't own a copy but a license to use the work in various ways, and some companies' licenses can theoretically prohibit even archival copies, which in turn may or may not be legal in various jurisdictions.
But in no jurisdiction is it legal to, for example, go with a portable photocopy machine into a bookstore and make copies of books you find in there, even if they are only for personal use: you first have to legally acquire an authorized copy from the rights holder. All other exemptions apply to what you do with that legally obtained copy.
This even means that you don't have any rights to use a fraudulent copy of a work, even if you legitimately believed you were obtaining a legal copy. For example, say a library legally bought a book from a shady bookstore that, unbeknownst to them, was selling counterfeit copies of a book. If the copyright holder finds out, they can legally force the library to pay them to continue offering this book, or to destroy it otherwise, along with any archival copies that they had made of this book. The library can of course seek to obtain reparations from the store that sold them the illegal copy, but they can't refuse to pay the legal copyright holder.
> I love how enthusiastic this post is while being wrong.
This is a very funny thing to say given that post is entirely correct, while you are wrong.
> Making a copy of a thing does not violate copyright
Yes it does, unless it's permitted under a designated copyright exemption by local law. For instance, you mention that the video player makes a copy from disk to memory, well that is explicitly permitted by Article 5(1) of the Copyright Directive 2001 in the EU as a use that is "temporary, transient or incidental and an integral and essential part of a technological process", as otherwise it would be illegal as by default, any action to copy is a breach of copyright. That's literally where the word comes from.
> If I own a license to listen to Metallica - Enter Sandman.m4a that I bought on iTunes and in the download folder I screw up and I make
> Metallica - Enter Sandman(1).m4a
> Metallica - Enter Sandman(2).m4a
> Metallica - Enter Sandman(3).m4a
In legal terms you do indeed owe him something, yes. It would probably be covered under the private copy exemptions in some EU territories, but only on the basis that blank media is taxed to pay rightsholders a royalty for these actions under the relevant collective management associations.
I got billed 1200€ for downloading 2 movies when I was 15. I will never use torrents again.
When injustice slaps you, you should do more of that, not less, but protecting yourself (vpn, tor, etc.)
Tell that my 15 year old self back in the day. I didn't even know by torreting id also seed, which was the part they got me for.
I assume this is Germany. Usually you can haggle it down to the low hundreds if it's the first time and you show you're just a regular young person with not much income.
Yea it was. Was 15 years ago though. Never looked back at it, I just won't ever use torrents again so I'll never face this issue again.
You mean some asshole asked your parents for that sum to not go to a trial that they would lose and your parents paid.
First off it was like 2 months after my father's death we didnt have time for this, secondly my mom got an attorney that I paid. Was roughly the same amount though. We never paid them.
> It requires slightly more complex firewall rules, and asking the network admin to put them in place might raise some eyebrow for reason 1
Well, in many such situations data is provided for free, putting huge burden on the other side. Even it it's a little bit less convenient it makes service a lot more sustainable. I imagine torrent for free tier and direct download as a premium option would work perfectly
I wish... I overpay more than double market value for my connection, and am not allowed to configure my router. This is the norm for most apartment dwellers in the US as far as I'm aware.
VPN is a thing
Sure, but I would also like to be able to forward ports. Right now, my best workaround is tailnet, which requires a client.
5. Most residential internet connections are woefully underprovisioned for upload so anything that uses it more (and yes you need people to seed for bittorrent to make sense) can slow down the entire connection.
6. Service providers have little control over the service level of seeders and thus the user experience. And that's before you get malicious users.
seeding is uploading after you are done downloading.
but you are already uploading while you are still downloading. and that can't be turned off. if seeding scares someone, then uploading should scare them too. so they are right, because they are required to upload.
If you are in the scene long enough, you should have known that there are some uncooperative clients that always send 0% (Xunlei being one of the more notorious example with their VIP schemes, and later on they would straight up spoof their client string when people started blocking them). Being a leecher nowadays is almost a norm for a lot of users, and I don't blame them since they are afraid of consequences in a more regulated jurisdiction. But a must seed when leech requirement? Hoho no, that's more like a suggestion.
> Hoho no, that's more like a suggestion.
For public trackers maybe.
> but you are already uploading while you are still downloading. and that can't be turned off
Almost every client let you set uploading limit, which you can set at 0. The only thing that generate upload bp usage that cannot be deactivated would be protocol stuff (but you can deactivate part of bt like using the DHT).
I think it should be up to the client to decide whether they want to seed. As another commenter mentioned, it could be for legal reasons. Perhaps downloading in that jurisdiction is legal but uploading is not. Perhaps their upload traffic is more expensive.
Now, as a seeder, you may still be interested in those clients being able to download and reach whatever information you are seeding.
In the same vein, as a seeder, you may just not serve those clients. That's kind of the beauty of it. I understand that there may be some old school/cultural "code of conduct" but really this is not a problem with a behavioral but instead with a technical solution that happens to be already built-in.
I think it should be up to the client to decide whether they want to seed
well, yes and no. legal issues aside (think about using bittorrent only for legal stuff), the whole point of bittorrent is that it works best if everyone uploads.
actually, allowing clients to disable uploading is almost an acknowledgement that illegal uses should be supported, because there are few reasons why legal uses should need to disable uploading.
and as an uploader i also don't want others not to upload. so while disabling upload is technically possible, it is also reasonable and not unlikely that connections from such clients could be rejected.
In the US, data caps are one reason to be stingy about seeding even if legality is not an issue. In that case though the user could still do something like limit the upload bandwidth while still seeding long-term, so their contribution is to provide availability and prevent a situation where no full seeds exist.
Some BitTorrent clients make it easier to set than others, but if it's a healthy torrent I often limit upload rate to so slow that it doesn't transfer anything up. Ratio is 0.00 and I still get 100s of mb/s.
Transmission allows turning this off by setting upload to 0. It's simply a client setting, but most clients don't offer it.
Webtorrent exists. It uses webrtc to let users connect to each other. There's support in popular trackers.
This basically handles every problem stated. There's nothing to install on computers: it's just js running on the page. There's no firewall rules or port forwarding to setup, all handled by the stun/turn in webrtc. Users wouldn't necessarily even be aware they are uploading.
STUN is not always possible and TURN means proxying the connection through a server which would be counter-productive for the purpose of using bit-torrent as an alternative to direct HTTP downloads as you are now paying for the bandwidth in both directions. This is very much not a problem with magic solutions.
Agreed! But STUN's success rate is pretty good! As the number of peers goes up it should be less likely that one would need to use TURN to connect to a peer, but I am skeptical webrtc is built to fall back like this, to try other peers first.
The advantage is that at least it's all builtin. It's not a magic solution, but it's a pretty good solution, with fallbacks builtin for when the networking gets in the way of the magic.
Amazon, Esri, Grab, Hyundai, Meta, Microsoft, Precisely, Tripadvisor and TomTom, along with 10s of other businesses got together and offer OSM data in Parquet on S3 free of charge. You can query it surgically and run analytics on it needing only MBs of bandwidth on what is a multi-TB dataset at this point. https://tech.marksblogg.com/overture-dec-2024-update.html
If you're using ArcGIS Pro, use this plugin: https://tech.marksblogg.com/overture-maps-esri-arcgis-pro.ht...
It's just great that bounding box queries can be translated into HTTP range requests.
As someone who works with mapping data for HGV routing, I've been keeping an eye on Overture. I wonder do you know if anyone has measured the data coverage and quality between this and proprietary datasets like HEREmaps? Does Overture supplement OSM road attributes (such as max height restrictions) where they can find better data from other sources?
I haven't done any deep dives into their road data but there was ~80 GB of it, mostly from TomTom, in the August release. I think the big question would be how much overlap there is with HERE and how would the metadata compare.
TomTom did a few write ups on their contribution, this one is from 2023: https://www.tomtom.com/newsroom/behind-the-map/how-tomtom-ma...
If you have QGIS running, I did a walkthrough using the GeoParquet Downloader Plugin with the 2.75B Building dataset TUM released a few weeks ago. It can take any bounding box you have your workspace centred on and download the latest transport layers for Overture. No need for a custom URL as its one of the default data sources the plugin ships with. https://tech.marksblogg.com/building-footprints-gba.html
Thanks for the response. There must be value in benchmarking data coverage and quality for routing data such as speed limits, vehicle restrictions, hazardous cargo etc... . I guess the problem is what do you benchmark against.
Overture is not just "OSM data in Parquet".
Thanks for the blatantly marketing Overture on a Thread about downloading OSM data.
I remember seeing the concept of "torrents with dynamic content" a few years ago, but apparently never became a thing[1]. I kind of wish it did, but I don't know if there are critical problems (i.e. security?).
[1]: https://www.bittorrent.org/beps/bep_0046.html
I assume it’s simply the lack of the inbuilt “universal client” that http enjoys, or that devs tend to have with ssh/scp. Not that such a client (even an automated/scripted CLI client) would be so difficult to setup, but then trackers are also necessary, and then the tooling for maintaining it all. Intuitively, none of this sounds impossible, or even necessarily that difficult apart from a few tricky spots.
I think it’s more a matter of how large the demand is for frequent downloads of very large files/sets, which leads to a questions of reliability and seeding volume, all versus the effort involved to develop the tooling and integrate it with various RCS and file syncing services.
Would something like Git LFS help here? I’m at the limit of my understanding for this.
I certainly take advantage of BitTorrent mirrors for downloading Debian ISOs, as they are generally MUCH faster.
All Linux ISOs collectors in the world wholeheartedly agree.
Are you serious? Most Debian ISO mirrors I've used have 10gig connectivity and usually push a gigabit or two fairly easily. BitTorrent is generally a lot slower than that (it's a pretty terrible protocol for connecting you to actually fast peers and getting stuff quickly from them).
I've definitely seen higher speeds with BitTorrent, pretty easily maxing out my gbe nics, but I'm not downloading Debian images specifically with much frequency.
Trackers haven't been necessary for well over a decade now thanks to DHT.
I used to work at a company that had to deliver huge files to every developer every week. At some point they switch from a thundering herd of rsyncs to using BitTorrent. The speed gains were massive.
Our previous cluster management software used Bittorrent for distributing application images.
It took maybe 10 seconds longer for the downloads to start, but they then ran almost as fast as the master could upload one copy.
World of Warcraft used a BitTorrent-like protocol for patches for awhile, as a default option if I remember right. https://www.bluetracker.gg/wow/topic/us-en/10043224047-need-... As an example mentioning it.
It became disliked because of various problems and complaints, but mainly disappeared because Blozzard got the bright idea of preloading the patchset, especially to new expansions, in the weeks before. You can send down a ten gig patch a month before release, and then patch that patch a week before release, and a final small patch on the day before release, and everything is preloaded.
The great Carboniferous explosion of CDNs inspired by Netflix and friends has also greatly simplified the market, too.
> Like container registries?
https://github.com/uber/kraken exists, using a modified BT protocol, but unless you are distributing quite large images to a very large number of nodes, a centralized registry is probably faster, simpler and cheaper
To use bittorrent, your machine has to listen, and otherwise be somehow reachable. In many cases, it's not a given, and sometimes not even desirable. It sticks out.
I think a variant of bittorrent which may be successful in corporate and generally non-geek environments should have the following qualities:
It's so obvious that it must have been implemented, likely multiple times. It would not be well-known because the very purpose of such an implementation would be to not draw attention.https://github.com/webtorrent/webtorrent
WebTorrent is ubiquitous by now and also supported by Brave and many torrent clients. There is still much room to build, though. Get in, the water is warm!
https://en.wikipedia.org/wiki/WebTorrent#Adoption
With some fairly minimal changes to HTTP it would be possible to get much of the benefit of bittorrent while keeping the general principals of HTTP:
https://rwmj.wordpress.com/2013/09/09/half-baked-idea-conten...
But yes I do think Bittorrent would be ideal for OSM here.
IPFS looked like a fun middle ground, but it didn't take off. Probably didn't help that it caught the attention of some Web 3.0 people.
https://en.wikipedia.org/wiki/InterPlanetary_File_System
In my experience the official software was very buggy and unreliable. Which isn't great for something about making immutable data live forever. I had bugs with silent data truncation, GC deleting live paths and the server itself just locking up and not providing anything it had to the network.
The developers always seemed focus on making new versions of the protocols with very minor changes (no more protocol buffers, move everything to CBOR) rather than actually adding new features like encryption support or making it more suitable for hosting static sites (which seems to have been on of its main niches).
It also would have been a great too for package repositories and other open source software archives. Large distros tend to have extensive mirror lists but you need to configure them, find out which ones have good performance for you and you can still only download from one mirror at a time. Decentralizing that would be very cool. Even if the average system doesn't seed any of the content the fact that anyone can just mirror the repo and downloads automatically start pulling from them was very nice. It also makes the download resilient to any official mirror going down or changing URL. The fact that there is strong content verification built in is also great. Typically software mirrors need to use additional levels of verification (like PGP signatures) to avoid trusting the mirror.
I really like the idea, and the protocol is pretty good overall. But the implementation and evolution really didn't work well in my opinion. I tried using it for a long time, offering many of my sites over it and mirroring various data. But eventually I gave up.
And maybe controversially it provided no capabilities for network separation and statistics tracking. This isn't critical for success but on entrypoint to this market is private file sharing sites. Having the option to use these things could give it a foot in the door and get a lot more people interested in development.
Hopefully the next similar protocol will come at some point, maybe it will catch on where IPFS didn't.
I used IPFS several years ago to get some rather large files from a friend, who had recently been interested in IPFS. From what I recall it took a full week or so to start actually transferring the files. It was so slow and finicky to connect. Bittorrent is dramatically easier to use, faster, and more reliable. It was hard to take IPFS seriously after that. I also recall an IRC bot that was supposed to post links to memes at IPFS links and they were all dead, even though it's supposed to be more resilient. I don't have the backstory on that one to know how/why the links didn't work.
What I wonder about is why we don't use the XOR principle more.
If A is a copyrighted work, and B is pure noise, then C=A^B is also pure noise.
Distribute B and C. Both of these files have nothing to do with A, because they are both pure noise.
However, B^C gives you back A.
I wouldn't expect that to hold up any more than a silly idea I had (probably not original) a while back of "Pi-Storage".
The basic idea being, can you using the digits of Pi to encode data, or rather, can you find ranges of Pi that map to data you have and use it for "compression".
A very simple example, let's take this portion of Pi:
> 3.14159265358979323846264338327950288419716939937
Then let's say we have a piece of data that, when encoded and just numbers, results in: 15926535897997169626433832
Can we encode that as: 4–15, 39–43, 21–25, 26–29 and save space? The "compression" step would take a long time (at some point you have to stop searching for overlap as Pi goes on for forever).
Anyways, a silly thought experiment that your idea reminded me of.
> C=A^B is also pure noise
Is C really "pure noise" if you can get A back out of it?
It's like an encoding format or primitive encryption, where A is merely transformed into unrecognizable data, meaningful noise, which still retains the entirety of the information.
> Is C really "pure noise" if you can get A back out of it?
If you throw out B, then there's no possible way to get A out of C (short of blindly guessing what A is): that's one of the properties of a one-time pad.
But distributing both B and C is no different than distributing A in two parts, and I'd have a hard time imagining it would be treated any differently on a legal level.
No, C is really noise, fundamentally.
Imagine another copyrighted work D.
E=C^D, therefore C=D^E
As you see, the same noise can be used to recover a completely different work.
Since you can do this with any D, C is really noise and not related to any D or A.
I'm not sure I agree. In the case of new source D, C is being used as the key, not the encoded data.
> B^C gives you back A
If both B and C are pure noise, where did the information for A come from?
XOR is commutative so for a one-time pad key and ciphertext can be swapped.
In this example given ciphertext C = A ^ B you can decrypt plaintext A using key B or you can decrypt plaintext D using key E = C ^ D:
( A ^ B ) ^ B = A
( A ^ B ) ^ ( C ^ D ) = ( A ^ B ) ^ ( A ^ B ) ^ D = D
B and C are pure noise individually but they are not uncorrelated just like E and C are not uncorrellated.
Thanks for engaging in this discussion and explaining your thoughts. I'm still trying to understand.
> given ciphertext C = A ^ B you can decrypt plaintext A using key B
Right, C ^ B = A, to get back the original.
Already here, it seems to me that C is not pure random noise if A can be recovered from it. C contains the encoded information of A, so it's not random. Is that wrong?
---
> or you can decrypt plaintext D using key E = C ^ D
In this example, isn't E the cyphertext and C is the key? E (cyphertext) = D (original) ^ C (key).
Then E ^ C = D, to get back the original.
Here, it seems to me that E contains the encoded information of D, so it's not random. And C plays the role of a key similar to B, so it's not being decoded as a ciphertext in this case, and nothing is implied whether it's random or not.
> XOR is commutative so for a one-time pad key and ciphertext can be swapped.
Maybe this is what I'm missing. In the second example, I'll swap the ciphertext and key:
C ^ E = D
Hmm, so both C and E are required to recover D the original information. Does that mean that the information was somehow distributed into C and E, so that they are both meaningful data, not random?
---
But what about B in the first example, a key that was supposed to be pure random? If I swap the ciphertext and key:
B ^ C = A
The information in A is recovered from the interaction of both B and C. So the entirety of the information is not in C, the ciphertext (or key in this case).
Does that mean, B is no longer pure noise? When it was used as a key, it became meaningful in relation to A. That's information, isn't it?
That's just encryption with a one-time pad, nothing new...
What is the point of this? If you think you can mount an adequate defense based on xor in a court of law, then you are sorely mistaken. Any state attorney will say infringement with an additional step of obfuscation is still infringement, and any judge will follow that assessment.
The point is that you don't end in court in the first place because you were just downloading noise.
That's a delusion.
From a network point of view, BitTorrent is horrendous. It has no way of knowing network topology which frequently means traffic flows from eyeball network to eyeball network for which there is no "cheap" path available (potentially causing congestion of transit ports affecting everyone) and no reliable way of forecasting where the traffic will come from making capacity planning a nightmare.
Additionally, as anyone who has tried to share an internet connection with someone heavily torrenting, the excessive number of connections means overall quality of non-torrent traffic on networks goes down.
Not to mention, of course, that BitTorrent has a significant stigma attached to it.
The answer would have been a squid cache box before, but https makes that very difficult as you would have to install mitm certs on all devices.
For container images, yes you have pull through registries etc, but not only are these non-trivial to setup (as a service and for each client) the cloud providers charge quite a lot for storage making it difficult to justify when not having a check "works just fine".
The Linux distros (and CPAN and texlive etc) have had mirror networks for years that partially addresses these problems, and there was an OpenCaching project running that could have helped, but it is not really sustainable for the wide variety of content that would be cached outside of video media or packages that only appear on caches hours after publishing.
BitTorrent might seem seductive, but it just moves the problem, it doesn't solve it.
> From a network point of view, BitTorrent is horrendous. It has no way of knowing network topology which frequently means traffic flows from eyeball network to eyeball network for which there is no "cheap" path available...
As a consumer, I pay the same for my data transfer regardless of the location of the endpoint though, and ISPs arrange peering accordingly. If this topology is common then I expect ISPs to adjust their arrangements to cater for it, just the same as any other topology.
> ISPs arrange peering accordingly
Two eyeball networks (consumer/business ISPs) are unlikely to have large PNIs with each other across wide geographical areas to cover sudden bursts of traffic between them. They will, however, have substantial capacity to content networks (not just CDNs, but AWS/Google etc) which is what they will have built out.
BitTorrent turns fairly predictable "North/South" traffic where capacity can be planned in advance and handed off "hot potato" as quickly as possible, into what is essentially "East/West" with no clear consistency which would cause massive amounts of congestion and/or unused capacity as they have to carry it potentially over long distances they have not been used to, with no guarantee that this large flow will exist in a few weeks time.
If BitTorrent knew network topology, it could act smarter -- CDNs accept BGP feeds from carriers and ISPs so that they can steer the traffic, this isn't practical for BitTorrent!
> If BitTorrent knew network topology, it could act smarter -- CDNs accept BGP feeds from carriers and ISPs so that they can steer the traffic, this isn't practical for BitTorrent!
AFAIK this has been suggested a number of times, but has been refused out of fears of creating “islands” that carry distinct sets of chunks. It is, of course, an non-issue if you have a large number of fast seeds around the world (and if the tracker would give you those reliably instead of just a random set of peers!), but that really isn't what BT is optimized for in practice.
Exactly. As it happens, this is an area I'm working on right now -- instead of using a star topology (direct), or a mesh (BitTorrent), or a tree (explicitly configured CDN), to use an optimistic DAG. We'll see if it gets any traction.
bittorrent will make best use of what bandwidth is available. better think of it as a dynamic cdn which can seamlessly incorporate static cdn-nodes (see webseed).
it could surely be made to care for topology but imho handing that problem to congestion control and routing mechanisms in lower levels works good enough and should not be a problem.
> bittorrent will make best use of what bandwidth is available.
At the expense of other traffic. Do this experiment: find something large-ish to download over HTTP, perhaps an ISO or similar from Debian or FreeBSD. See what the speed is like, and try looking at a few websites.
Now have a large torrent active at the same time, and see how slow the HTTP download drops to, and how much slower the web is. Perhaps try a Twitch stream or YouTube video, and see how the quality suffers greatly and/or starts rebuffering.
Your HTTP download uses a single TCP connection, most websites will just use a single connection also (perhaps a few short-duration extra connections for js libraries on different domains etc). By comparison, BitTorrent will have dozens if not hundreds of connections open and so instead of sharing that connection in half (roughly) it is monopolising 95%+ of your connection.
The other main issue I forgot to mention is that on most cloud providers, downloading from the internet is free, uploading to the internet costs a lot... So not many on public cloud are going to want to start seeding torrents!
If your torrent client is having a negative effect on other traffic then use its bandwidth limiter.
You can also lower how many connections it makes, but I don't know anyone that's had need to change that. Could you show us which client defaults to connecting to hundreds of peers?
My example was to show locally what happens -- the ISP does not have control over how many connections you make. I'm saying that if you have X TCP connections for HTTP and 100X TCP connections for BitTorrent, the HTTP connections will be drowned out. Therefore, when the link at your ISP becomes congested, HTTP will be disproportionately affected.
For the second question, read the section on choking at https://deluge-torrent.org/userguide/bandwidthtweaking/ and Deluge appears to set the maximum number of connections per torrent of 120 with a global max of 250 (though I've seen 500+ in my brief searching, mostly for Transmission and other clients).
I'll admit a lot of my BitTorrent knowledge is dated (having last used it ~15 years ago) but the point remains: ISPs are built for "North-South" traffic, that is: To/From the customer and the networks with the content, not between customers, and certainly not between customers of differing ISPs.
Torrents don't use anything like TCP congestion control and 100 connections will take a good chunk of bandwidth but much much less than 100 TCP flows.
... What? You realise BitTorrent runs over TCP/IP right?
TCP is a fallback if it can't use https://en.m.wikipedia.org/wiki/Micro_Transport_Protocol
I should have said they avoid TCP in favor of very different congestion control, sorry.
Interesting... It's been ~15 years since I last used BitTorrent personally, and I had asked a friend before replying and they swore that all their traffic was TCP -- though perhaps that may be due to CGNAT or something similar causing that fallback scenario you describe.
Thanks for the info, and sorry for jumping to a conclusion! Though my original point stands: Residential ISPs are generally not built to handle BitTorrent traffic flows (customer to customer or customer to other-ISP-customer across large geographic areas) so the bursty nature would cause congestion much easier, and BitTorrent itself isn't really made for these kinds of scenarios where content changes on a daily basis. CDNs exist for a reason, even if they're not readily available at reasonable prices for projects like OP!
The number of connections isn’t relevant. A single connection can cause the same problem with enough traffic. Your bandwidth is not allocated on a per-connection basis.
If you download 2 separate files over HTTP, you'd expect each to get roughly 1/2 of the available bandwidth at the bottleneck.
With 1 HTTP connection downloading a file and 100 BitTorrent connections trying to download a file, all trying to compete, you'll find the HTTP throughput significantly reduced. It's how congestion control algorithms are designed: rough fairness per connection. That's why the first edition of BBR that Google released was unpopular, it stomped on other traffic.
> Like container registries? Package repos, etc.
I had the same thoughts for some time now. It would be really nice to distribute software and containers this way. A lot of people have the same data locally and we could just share it.
I think the reason is mainly that modern pipes are big enough that there is no need to bother with a protocol as complex as BT.
I think a big part of why it's not more widely used comes down to a mix of UX friction, NAT/firewall issues, and lack of incentives
I agree with the sentiment but I need those files behind a corporate firewall. :(
AFAIK Bittorrent doesn't allow for updating the files for a torrent.
This is a feature, not a bug. Torrent file/magnet link contains hash of a data which is immutable. Just publish new link (you should anyway, even with http)
Agreed, it's cool that you can reverify a torrent from 3 years ago and make sure the data on your disk isn't lost or damaged.
That's useful if you did not prepare for that eventuality. But if you do want to store data long-term it's better to generate error-correcting codes like PAR2 so that you can actually recover partial errors without having to rely on seeders being available - and when you do that you no longer need the torrent for verification.
Imagine if for websites you had to get a brand new domain if you ever wanted to change the contents of your web page. You can't just go to google.com because it would be immutable. You would have to somehow know you have to go to google10393928.com and any links to google on the internet would be linking to some old version. The ability to have a link always refer to the latest version is useful. The same applies to torrents. It's possible for the magnet link to the latest version to get lost and then a bunch of people are accidently downloading old, worse versions of the file with no way to find the newest version.
What a dumb strawman. Domains are not hashes.
But imagine if domains didn't exist then. At least try to see their point instead of calling it a strawman.
Immutability of specific releases is great, but you also want a way to find new related releases/versions.
I don't need to imagine domains do not exist. They exist. OP brought in domains as an example.
For some you want a name where the underlying resource can change, for others you want a hash of the actual resource. Which one you want depends on the application.
> I don't need to imagine domains do not exist. They exist. OP brought in domains as an example.
In the context of webpages, a domain lets you deploy new versions.
With a torrent file, a domain does not let you do that.
Please try to understand the comparison they're making instead of just saying "domains are not hashes" "domains do exist".
> For some you want a name where the underlying resource can change, for others you want a hash of the actual resource. Which one you want depends on the application.
Right.
And torrents don't give you the choice.
Not having the choice is much closer to "bug" than "feature".
Needing a new magnet link is fine. The old magnet link working indefinitely is great. Having no way to get from old magnet to new magnet is not as fine.
The OP brought in domains as an example but domains are not applicable.
Everybody on HN should know how a domain works. I think most people on HN understand what a hash is and how a magnet link works. The fact that you can't easily replace the resource under a magnet link is a feature not a bug. If you think for a bit about the consequences of what would happen if you could easily replace the resources associated with a magnet link rather than just having the 'convenience of being able to update a torrent' and you'll see that this is not a simple thing at all.
Torrents are simply a different thing than 'the web' and to try to equate the one to the other is about as silly as trying to say that you can't use a screwdriver to put nails in the wall. They're different things. Analogies are supposed to be useful, not a demonstration of your complete lack of understanding of the underlying material.
I distribute some software that I wrote using a torrent with a magnet link, so I'm well aware of the limitations there, but these limitations are exactly why I picked using a torrent in the first place.
> The fact that you can't easily replace the resource under a magnet link is a feature not a bug.
I didn't even go that far. I just said link to a new one.
You're the one that said replacing can be good! What is this.
It's a straw man because no one is telling you to replace domains with torrents. You'd replace the download link to https://yourwebsite.com/files/download-1.0.zip with https://yourwebsite.com/files/download-1.0.zip.torrent or the magnet URL corresponding to that file. Even if you wanted to replace HTTP with torrents entirely then the domain would be updated to point to the current torrent - after all the whole point of domains is that you have a memorable and persistent name that can be resolved to the current location of a service.
That works fine if it's a download link on a website. You offload the updating to the website.
There are many other ways torrents get used, where people aren't looking at the website or there is no website.
It is technically possible, and there is a proposal to standardize it, but it has been in draft state for nearly 10 years https://www.bittorrent.org/beps/bep_0046.html
A lot of people will be using this data at work where BitTorrent is a non-starter.
Or IPFS/IPNS
I have a more direct answer for you: moderation.
It's not all about how you distribute content. We must also decide which content do distribute, and that is a hard problem.
The most successful strategy so far has been moderation. Moderation requires hierarchical authority: a moderator who arbitrarily determines which data is and is not allowed to flow. Even bittorrent traffic is moderated in most cases.
For data to flow over bittorrent, two things must happen:
1. There must be one or more seeders ready to connect when the leecher starts their download.
2. There must be a way for a prospective leecher to find the torrent.
The best way to meet both of these needs is with a popular tracker. So here are the pertinent questions:
1. Is your content fit for a popular tracker? Will it get buried behind all the Disney movies and porn? Does it even belong to an explicit category?
If not, then you are probably going to end up running your own tracker. Does that just mean hosting a CDN with extra steps? Cloud storage is quite cheap, and the corporate consolidation of the internet by Cloudflare, Amazon, etc. has resulted in a network infrastructure that is optimized for that kind of traffic, not for bittorrent.
2. Is a popular tracker a good fit for your content? Will your prospective downloaders even think to look there? Will they be offended by the other content on that tracker, and leave?
Again, a no will lead to you making your own tracker. Even in the simplest case, will users even bother to click your magnet link, or will they just use the regular CDN download that they are used to?
So what about package repos? Personally, I think this would be a great fit, particularly for Nix, but it's important to be explicit about participation. Seeding is a bad default for many reasons, which means you still need a relatively reliable CDN/seed anyway.
---
The internet has grown into an incredibly hierarchical network, with incredibly powerful and authoritative participants. I would love to see a revolution in decentralized computing. All of the technical needs are met, but the sociopolitical needs need serious attention. Every attempt at decentralized content distribution I have seen has met the same fate: drowned in offensive and shallow content by those who are most immediately excited to be liberated from authority. Even if it technically works, it just smells too foul to use.
I propose a new strategy to replace moderation: curation. Instead of relying on authority to block out undesirable content, we should use attested curation to filter in desirable content.
Want to give people the option to browse an internet without porn? Clearly and publicly attest which content is porn. Don't light the shit on fire, just open the windows and let it air out.
People like Geofabrik are why we can (sometimes) have nice things, and I'm very thankful for them.
Level of irresponsibility/cluelessness you can see from developers if you're hosting any kind of an API is astonishing, so downloads are not surprising at all...If someone, a couple of years back, told me things that I've now seen, I'd absolutely dismiss them as making stuff up and grossly exaggerating...
However, on the same token, it's sometimes really surprising how API developers rarely ever think in terms of multiples of things - it's very often just endpoints to do actions on single entities, even if nature of use-case is almost never on that level - so you have no other way than to send 700 requests to do "one action".
> Level of irresponsibility/cluelessness you can see from developers if you're hosting any kind of an API is astonishing
This applies to anyone unskilled in a profession. I can assure you, we're not all out here hammering the shit out of any API we find.
With the accessibility of programming to just about anybody, and particularly now with "vibe-coding" it's going to happen.
Slap a 429 (Too Many Requests) in your response or something similar using a leaky-bucket algo and the junior dev/apprentice/vibe coder will soon learn what they're doing wrong.
- A senior backend dev
> Slap a 429 [...] will soon learn what they're doing wrong.
Oh how I wish this was true. We have customers sending 10-100s requests per second and they will complain if even just one gets 429. As in, they escalate to their enterprise account rep. I always tell them to buy the customer a "error handling for dummies" book but they never learn.
Another commenter said essentially the same thing, I sympathise, it's painful when the "customer" can't understand something clearly telling them they're doing it wrong.
I don't have an answer, but I wonder, for the sake of protecting your own org, is some sort of "abuse" policy the only approach; as in, deal with the human problem, be clear with them in writing somewhere that if they're seeing response X or Y (429 etc) that they're essentially abusing the API and need to behave.
The only thing that reliably works is to push the cost to the customer - so they can do whatever insanity they want, and they get charged accordingly.
And we’ve had cases where we had “fuckoff” charges in contracts (think $1 per API call after X thousand calls a day) and the customer just gladly pays tens of thousands of dollars and thanks us for fixing their problem.
The money is nice but sometimes you just want to shake them and say “look we have notifications you don’t need to poll the endpoint ten times a second, fuck it give us the code and we’ll fix it …”
I bet if the costs were an order of magnitude larger, they'd think the costs were as unreasonable as we think their huge number of requests are.
There's just no winning sometimes sigh.
It's hard to switch them over, but if you have it in the beginning, you boil the frog.
Heh, exactly my other reply - I feel for you, friend!
Yes
Well, if this is a supported (as in $) account, sure enough, have the API rate limits published and tell them in the most polite way to RTFM
Thanks for the reply - I did not mean to rant, but, unfortunately, this is in context of a B2B service, and the other side are most commonly IT teams of customers.
There are, of course, both very capable and professional people, and also kind people who are keen to react / learn, but we've also had situations where 429s result in complaints to their management how our API "doesn't work", "is unreliable" and then demanding refunds / threatening legal action etc...
One example was sending 1.3M update requests a day to manage state of ~60 entities, that have a total of 3 possible relevant state transitions - a humble expectation would be several requests/day to update batches of entities.
> I did not mean to rant
Not at all, I sympathise, we're all like minded people here!
> One example was sending 1.3M update requests a day to manage state of ~60 entities, that have a total of 3 possible relevant state transitions
> but we've also had situations where 429s result in complaints to their management how our API "doesn't work", "is unreliable" and then demanding refunds / threatening legal action etc
That's painful, and at this point, we're beyond technical solutions, this need human solutions. If they can't realise that they're rate limited because they're basically abusing the API, they need to be told in no uncertain terms.
Of course I understand that it's not that simple, as a backend dev, my "customers" are usually other devs so I can be frank, but when dealing with B2B customers we often have to act like they're not in the wrong.
But that is a question that should be escalated to management right? If they charge the customer enough that allowing them to make 1.3M requests to update 60 entities makes sense, why not let them?
If they want the service stupidly overprovisioned to deal with these nutjobs, then that’s what we’ll do. I find that they’re generally receptive to the message of $10k per month to serve nutjobs, $100 per month to serve everyone else, though.
That’s the key - have that in place from the beginning.
Because many “enterprise” customers can spend literally millions doing shit the absolute wrong way, but have $0 budget for a developer to make it work right.
I don't understand why features like S3's "downloader pays" isn't more widely used (and available outside AWS). Let the inefficient consumer bear their own cost.
Major downside is that this would exclude people without access to payment networks, but maybe you could still have a rate-limited free option.
They mention a single user downloading a 20GB file thousands of times on a single day, why not just rate limit the endpoint?
Their download service does not require authentication, and they are kind enough to be hesitant about blocking IPs (one IP could be half of a university campus, for example). So that leaves chasing around to find an individual culprit and hoping they'll be nice and fix it.
They could however rate-limit per IP with possibly a whitelist or higher limits for cooperative universities that are willing to prevent abuse from their own network. Unless they are facing a DDOS but then this appeal is even less likely to help.
Honestly, both sides could use a little more empathy: clients need to respect shared infrastructure, and API devs need to think more like their users
Sounds like someone people are downloading it in their CI pipelines. Probably unknowingly. This is why most services stopped allowing automated downloads for unauthenticated users.
Make people sign up if they want a url they can `curl` and then either block or charge users who download too much.
I'd consider CI one of the worst massive wastes of computing resources invented, although I don't see how map data would be subject to the same sort of abusive downloading as libraries or other code.
This stuff tends to happen by accident. Some org has an app that automatically downloads the dataset if it's missing, helpful for local development. Then it gets loaded in to CI, and no one notices that it's downloading that dataset every single CI run.
At some point wilful incompetence becomes malice. You really shouldn't allow network requests from your CI runners unless you have something that cannot be solved in another way (hint: you don't).
Let's say you're working on an app that incorporates some Italian place names or roads or something. It's easy to imagine how when you build the app, you want to download the Italian region data from geofabrik then process it to extract what you want into your app. You script it, you put the script in your CI...and here we are:
> Just the other day, one user has managed to download almost 10,000 copies of the italy-latest.osm.pbf file in 24 hours!
You shouldn't download that data on demand at build time. Dependencies on state you don't control are bad even without the bandwidth issues.
Whenever people complain about the energy usage of LLM training runs I wonder how this stacks up against the energy we waste by pointlessly redownloading/recompiling things (even large things) all the time in CI runs.
Optimising CI pipelines has been a strong aspect of my career so far.
Anybody can build a pipeline to get a task done (thousands of quick & shallow howto blog posts) but doing this efficiently so it becomes a flywheel rather than a blocker for teams is the hard part.
Not just caching but optimising job execution order and downstream dependencies too.
The faster it fails, the faster the developer feedback, and the faster a fix can be introduced.
I quite enjoy the work and always learning new techniques to squeeze extra performance or save time.
Also for some reason, most CI runners seem to cache nothing except for that minor thing that you really don't want cached.
This is exactly it - you can cache all the wrong things easily, cache only the code you wanted changed, or cache nothing but one small critical file nobody knows about.
No wonder many just turn caching entirely off at some point and never turn it back on.
CI is great for software reliability but it should not be allowed to make network requests.
CI itself doesn't have to be a waste. The problem is most people DGAF about caching.
You don't need caching if your build can run entirely offline in the first place.
I suspect web apps that "query" the GPKG files. Parquet can be queried surgically, I'm not sure if there is a way to do the same with GPKG.
Can we identify requests from CI servers reliably?
You can identify requests from Github's free CI reliably which probably covers 99% of requests.
For example GMP blocked GitHub:
https://www.theregister.com/2023/06/28/microsofts_github_gmp...
This "emergency measure" is still in place, but there are mirrors available so it doesn't actually matter too much.
I try to stick to GitHub for GitHub CI downloads.
E.g. my SQLite project downloads code from the GitHub mirror rather than Fossil.
Sure, have a js script involved in generating a temporary download url.
That way someone manually downloading the file is not impacted, but if you try to put the url in a script it won’t work.
There is really no reason to add a JS dependency for this - whatever server-side component expires old URLs can just as well update the download page with the new one.
Having some kind of lightweight auth (API key, even just email-based) is a good compromise
Some years ago I thought, no one would be stupid enough to download 100+ megabytes in their build script (which runs on CI whenever you push a commit).
Then I learned about Docker.
It's like, once it's in a container, people assume it's magic and free
Wait, docker caches layers, you don't have to rebuild everything from scratch all the time... right?
It does if you are building on the same host with preserved state and didn't clean it.
There are lots of cases where people end up with with an empty docker repo at every CI run or regularly empty the repo because docker doesn't have any sort of intelligence space management (like LRU).
To get fine-grained caching you need to use cache-mounts, not just cache layers. But the cache export doesn't include cache mounts, therefore the docker github action doesn't export cache mounts to the CI cache.
https://github.com/moby/buildkit/issues/1512
That's why I build project specific images in CI to be used in CI. Running apt-get every single time takes too damn long.
Alternatively you can use a local cache like AptCacherNg
"There have been individual clients downloading the exact same 20-GB file 100s of times per day, for several days in a row. (Just the other day, one user has managed to download almost 10,000 copies of the italy-latest.osm.pbf file in 24 hours!) Others download every single file we have on the server, every day."
This sounds like problem rate-limiting would easily solve. What am I missing. The page claims almost 10,000 copies of same file were downloaded by the same user
The server operator is able to count the number of downloads in a 24h period for an individual user but cannot or will not set a rate limit
Why not
Will the users mentioned above (a) read the operator's message on this web page and then (b) change their behaviour
I would be bet against (a) and therefore (b) as well
Geofabrik guy here. You are right - rate limiting is the way to go. It is not trivial though. We use an array of Squid proxies to serve stuff and Squid's built-in rate limiting only does IPv4. While most over-use comes from IPv4 clients it somehow feels stupid to do rate limiting on IPv4 and leave IPv6 wide open. What's more, such rate-limiting would always just be per-server which, again, somehow feels wrong when what one would want to have is limiting the sum of traffic for one client across all proxies... then again, maybe we'll go for the stupid IPv4-per-server-limit only since we're not up against some clever form of attack here but just against carelessness.
Stick tables work with either IPv4 or IPv6
Oh hey, it's me, the dude downloading italy-latest every 8 seconds!
Maybe not, but I can't help but wonder if anybody on my team (I work for an Italian startup that leverages GeoFabrik quite a bit) might have been a bit too trigger happy with some containerisation experiments. I think we got banned from geofabrik a while ago, and to this day I have no clue what caused the ban; I'd love to be able to understand what it was in order to avoid it in the future.
I've tried calling and e-mailing the contacts listed on geofabrik.de, to no avail. If anybody knows of another way to talk to them and get the ban sorted out, plus ideally discover what it was from us that triggered it, please let me know.
Hey there dude downloading italy-latest every 8 seconds, nice to hear from you. I don't think I saw an email from you at info@geofabrik, could you re-try?
Absolutely, a couple of them got bounced but one should've gone through. I'll retry as soon as I get into the office.
Edit: I sent you the email, it got bounced again with a 550 Administrative Prohibition. Will try my university's account as well.
Edit2: this one seems to have gone through, please let me know if you can't see it.
Do they email heavy users? We used Nominatim free api for geocoding addresses in 2012 and our email was required parameter. They mailed us and asked us to cache results to reduce request rates.
There's no login, so they won't have any email addresses.
IIRC users voluntarily provided emails per request.
I have a funny feeling that the sort of people who do these things don't read these sorts of blog posts.
Definitely a use case for bittorrent.
If the data changes, how would a torrent client pick it up and download changes?
Let the client curl latest.torrent from some central service and then download the big file through bittorrent.
A lot of torrent client support various API to automatically collect torrent file. The most common is to simply use RSS.
There's a BEP for updatable torrents.
Pretty sure people used or even still use RSS for this.
>Just the other day, one user has managed to download almost 10,000 copies of the italy-latest.osm.pbf file in 24 hours!
Whenever I have done something like that, it's usually because I'm writing a script that goes something like:
1. Download file 2. Unzip file 3. Process file
I'm working on step 3, but I keep running the whole script because I haven't yet built a way to just do step 3.
I've never done anything quite that egregious though. And these days I tend to be better at avoiding this situation, though I still commit smaller versions of this crime.
My solution to this is to only download if the file doesn't exist. An additional bonus is that the script now runs much faster because it doesn't need to do any expensive networking/downloads.
When I do scripts like that I modify it to skip the download step and keep the old file around so I can test the rest without anything time-consuming.
10,000 times a day is on average 8 times a second. No way someone has 8 fixes per second, this is more like someone wanted to download a new copy every day, or every hour but they messed up milliseconds config or something. Or it's simply malicious user.
edit: bad math, it's 1 download every 8 seconds
An easy way to avoid this is to have several scripts (bash for example):
getfile.sh processdata.sh postresults.sh doall.sh
And doall.sh consists of: ./getfile.sh ./processdata.sh ./postresults.sh
Meta: the title could be clearer, e.g. "Download OSM Data Responsibly" would have helped me figure out the context faster as someone not familiar with the domain name shown.
I think the original title is fine, in this case if it's OSM or not doesn't really matter as it applies to many other instances of people downloading data in a loop for no good reason.
Then there will be an another comment: What is OSM?
Current title is fine.
Seems a perfect justification for using api keys. Unless I'm missing the nuance of this software model.
But that raises the complexity of hosting this data immensely. From a file + nginx you now need active authentication, issuing keys, monitoring, rate limiting...
Yes, this the the "right" solution but it is a huge pain and it would be nice if we could have nice things without needing to do all of this work.
This is tragedy of the commons in action.
Speaking as the person running it - introducing API keys would not be a big deal, we do this for a couple paid services already. But speaking as a person frequently wanting to download free stuff from somewhere, I absolutely hate having to "set up an account" just to download something once. I started that server well over a decade ago (long before I started the business that now houses it); the goal has always been first and foremost to make access to OSM data as straightforward as possible. I fear that having to register would deter many a legitimate user.
Yeah, I totally get it. In an ideal world we could just stick a file on an HTTP server and people would download it reasonably. Everything is simpler and happier this way.
There’s a cheapish middle ground - generate unique URLs for each downloaded, which basically embeds a UUID “API” key.
You can paste it into a curl script, but now the endpoint can track it.
So not example.com/file.tgz but example.com/FCKGW-RHQQ2-YXRKT-8TG6W-2B7Q8/file.tgz
Yeah, but everyone knows that one. ;)
Everyone also knows the API keys that are used for requests from clients (apps/websites/etc.). ;)
Chances of those few people doing very large amounts of downloads reading this are quite small. Basic rate limits on IP level or some other simple fingerprinting would do a lot of good for those edge cases, as those folks are most likely not aware of this happening.
Yes, IP rate limiting is not perfect, but if they have a way of identifying which user is downloading the whole planet every single day, that same user can be throttled as well.
They should rate limit it with a limit high enough that normal use doesn't hit it
Appeals to responsibility aren't going to sink in to people that are clearly careless
But rate-limiting public data is a huge pain. You can't really just have a static file anymore. Maybe you can configure your HTTP server to do IP based rate limiting but that is always ineffective (example public clouds where the downloader gets a new IP every time) or hits bystanders (a reasonable download the egresses out of the same IP or net block).
So if you really want to do this you need to add API keys and authentication (even if it is free) to reliably track users.
Even then you will have some users that find it easier to randomly pick from 100 API keys rather than properly cache the data.
Rate limit anonymous, unlimited if you provide an API key.
This way you can identify WHO is doing bad things and disable said API key and/or notify them.
I continue to be baffled the geofabrik folks remain the primary way to get a clean-ish OSM shapefile. Big XKCD "that one bloke holding up the internet" energy.
Also, everyone go contribute/done to OSM.
> primary way to get a clean-ish OSM shapefile
Shapefiles shouldn't be what you're after, Parquet can almost always do a better job unless you need to either edit something or use really advanced geometry not yet supported in Parquet.
Also, this is your best source for bulk OSM data: https://tech.marksblogg.com/overture-dec-2024-update.html
If you're using ArcGIS Pro, use this plugin: https://tech.marksblogg.com/overture-maps-esri-arcgis-pro.ht...
It's beneficial to the wider community, and also supports their commercial interests (OSM consulting). Win-win.
>one user has managed to download almost 10,000 copies of the italy-latest.osm.pbf file in 24 hours!
Trash code most likely. Rate limit their IP. No one cares about people that do this kind of thing. If they're a VPN provider...then still rate limit them.
Ah, responsibility... The one thing we hate teaching and hate learning even more. Someone is probably downloading files in some automated pipeline. Nobody taught them that with great power (being able to write programs and run them on the internet) comes great responsibility. It's similar to how people drive while intoxicated or on the phone etc. It's all fun until you realise you have a responsibility.
I mean, at this point I wouldn't mind if they rate-limit downloads. A _single_ customer downloading the same file 10.000 times? Sorry, we need to provide for everyone, try again at some other point.
It is free, yes, but there is no need to either abuse it or give as much resource for free as they can.
This. Maybe they could actually make some infra money out of this. Make token-based free tier download, pay if you break it.
Can't the server detect and prevent repeated downloads from the same IP, forcing users to act accordingly?
See: "Also, when we block an IP range for abuse, innocent third parties can be affected."
Although they refer to IP ranges, the same principle applies on a smaller scale to a single IP address: (1) dynamic IP addresses get reallocated, and (2) entire buildings (universities, libraries, hotels, etc.) might share a single IP address.
Aside from accidentally affecting innocent users, you also open up the possibility of a DOS attack: the attacker just has to abuse the service from an IP address that he wants to deny access to.
More sophisticated client identification can be used to avoid that edge case, e.g. TLS fingerprints. They can be spoofed as well, but if the client is going through that much trouble, then they should be treated as hostile. In reality it's more likely that someone is doing this without realizing the impact they're having.
It could be slightly more sophisticated than that. Instead of outright blocking an entire IP range, set quotas for individual clients and throttle downloads exponentially. Add latency, cap the bandwidth, etc. Whoever is downloading 10,000 copies of the same file in 24 hours will notice when their 10th attempt slows down to a crawl.
It'll still suck for CI users. What you'll find is that occasionally someone else on the same CI server will have recently downloaded the file several times and when your job runs, your download will go slowly and you'll hit the CI server timeout.
CI users should cache the asset to both speed up CI runs and not run up the costs to orgs providing the assets free of charge
that's working as intended then, you should be caching such things. It sucking for companies that don't bother is exactly the point, no?
It's not unreasonable for each customer to maintain a separate cache (for security reasons), so that each of them will download the file once.
Then it only takes one bad user on the same subnet to ruin the experience for everyone else. That sucks, and isn't working as intended, because the intent was to only punish the one abusive user.
That's why subnets have abuse contacts. If the network operators don't care they can't complain about being throttled/blocked wholesale.
Why put a load of money into infra and none into simple mitigations like rate limiting to prevent the kind of issues they are complaining they need the new infra for?
Seems to me like the better action would be to implement rate-limiting, rather than complain when people use your resource in ways you don't expect. This is a solved problem.
Just wait until some AI dudes decide it is time to train on maps...
I’m looking forward to visiting all of the fictional places it comes up with!
AI models are trained relatively rarely, so it's unlikely this would be very noticeable among all the regular traffic. Just the occasional download-everything every few months.
One would think so. If AI bros were sensible, responsible and intelligent.
However, the pratical evidence is to the contrary, AI companies are hammering every webserver out there, ignoring any kind of convention like robots.txt, re-downloading everything in pointlessly short intervals. Annoying everyone and killing services.
Just a few recent examples from HN: https://news.ycombinator.com/item?id=45260793 https://news.ycombinator.com/item?id=45226206 https://news.ycombinator.com/item?id=45150919 https://news.ycombinator.com/item?id=42549624 https://news.ycombinator.com/item?id=43476337 https://news.ycombinator.com/item?id=35701565
IMHO in the long term this will lead to a closed web where you are required to log-in to view any content.
Map slop? That's new!
This is a good reminder that just because a server can handle heavy traffic doesn't mean it should be treated like a personal data firehose
I wonder if we're going to see more irresponsible software from people vibe coding shit together and running it without even knowing what it's doing.
Sounds like someone has an internal deployment script with the origin mirror in the update URI. This kind of hammering is also common for folks prototyping container build recipes, so don't assume the person is necessarily intending anyone harm.
Unfortunately, the long sessions needed for file servers do make them an easy target for DoS, and especially if supporting quality of life features like media fast-forward/seek features. Thus, a large file server normally does not share a nimble website or API server (CDNs still exist for a reason.)
Free download API keys with a set data/time Quota and IP rate limit are almost always necessary (i.e. the connection slows down to 1kB/s after a set daily limit.) Ask anyone that runs a Tile server or media platform for the connection-count costs of free resources.
It is a trade-off, but better than the email-me-a-link solution some firms deploy. Have a wonderful day =3
Good to know. Thanks!
> If you want a large region (like Europe or North America) updated daily, use the excellent pyosmium-up-to-date program
This is why naming matters. I never would have guessed in a million years that software named "pyosmium-up-to-date" does this function. Give it a better name and more people will use it.