I don't understand why so many people seem so fascinated by constructions like the library of Babel. Yes it contains the answers to all your questions, but there are some significant drawbacks.
* It has more wrong information than right information, with no way to tell the difference.
* If you had an oracle that could tell you how to get to the book you need, the navigation instructions to get to the book will be at least as long as the book, on average.
The Library of Babel made me aware that choosing/finding is not super distinct from making/creating. Or discovery and invention. In math, there is distinction between "there exists" and "we can construct", but "we can construct" is similar to "we can find".
I don't think they're equivalent. I think invention and creation aren't actually real. There is no "making" or "creating" when it comes to intellectual work.
All computer files are sequences of bits. All sequences of bits are integers. All integers already exist in the infinite set of natural numbers. I can even calculate how big those numbers are given their bit count.
We are merely discovering numbers through convoluted mental and technological processes. All our mental exertions result in the discovery of a number. This comment is a number.
Yes, I mean exactly this type of insight. Basically taking a digital photo with a camera technically also just picks out the "address" of your current environment within the space of all images. Any 4K 2-hour-length feature film in a digital format is also just an address in the space of all possible videos. The director, the actors, the whole crew did all that work in order to select that point from the space of possibilities, they didn't "create" anything. That movie already existed.
Of course this is silly, but interesting nonetheless. And we routinely speak about such high-dimensional spaces in research and engineering. Or we can imagine optimization as traversing a pre-existing search space. It may be structured as a graph or perhaps a Euclidean space. And in that space we can imagine a loss surface, that sits there in peace all along, with its global minimum somewhere. And instead of "constructing" a solution, we are simply hiking in this space and trying to spot that valley. But this is a bit fictional. We never physically "instantiate" this surface. It's an imagined abstraction. In reality we just have a vector and some rules as to how we change that vector. But we can imagine those changes to be movements in an imagined space.
It's like the idea that the sculptor doesn't create the sculpture, the sculpture was there all along, he just had to remove the superfluous matter to reveal what was already there (i.e. the atoms belonging to the final sculpture).
The most interesting thing is kind of on the border, between these absurdly large spaces and the more manageable ones that are feasible to enumerate.
Another similar mindblow thing was when I forgot the password to a file that I encrypted. It's a fascinating thing that the bit pattern on the disk is functionally random now, and cracking it would take longer than the age of the universe. But if only I knew the password, it would only take just a second. There is a definite sequence of keystrokes I can execute to bring the universe in a state where the content will appear on my screen, it's so close, yet it's so-so far if you don't remember the password. Just a little difference in your brain state and it flips from trivial to hopeless.
PS, if you like thinking about such things, I recommend Meta-Math by Gregory Chaitin, it's very fun (providing an address VS constructing the thing is basically the gist of algorithmic information theory).
> It's like the idea that the sculptor doesn't create the sculpture, the sculpture was there all along, he just had to remove the superfluous matter to reveal what was already there (i.e. the atoms belonging to the final sculpture).
I understand this argument but I have far more trouble applying this logic to real things. I'm not sure the same logic applies once the information is instantiated in the real world as a physical object. I haven't thought very deeply about it. I think the true sculpture exists only in the ideal world and the real world object is merely an approximation of it.
> Of course this is silly
It's an existential issue for me. At some point it became a political issue. I became a copyright abolitionist because of this insight. Copyright is logically reducible to monopolistic ownership of numbers. The sheer absurdity of it led me to reject the very idea of intellectual property as delusional nonsense.
I'm not sure the law has ever been concerned with logical reducibility. Context that can't easily been defined objectively has always been a part of legal systems, and arguably is a feature rather than a bug. Stuff like the "reasonable person" standard are intentionally flawed concepts that allow laws to exist without needing to define every possible permutation of human behavior up front. This obviously doesn't mean that you won't necessarily look at everything and decide to be an anarchist because of how convoluted it all is, but I don't think that being mathematically inconsistent is particularly unique to copyright in the legal system.
Exactly, it's a common failure mode for math/programming-minded people when encountering the law. But the law is not like a compiler, mechanically following some fully-specified set of rules.
The legal system is rather the spiritual successor of the original "system" where a wise Solomon-like elder would adjugate the issue based on their best judgment and intuition and customs, ideally seeking peace and social satisfaction and future harmony. Codified law channels this into some more pre-shaped form, but the fuel of the legal system is still the human judgment and common sense at the core. Often the law basically just prompts and nudges the judgment of the jurors or judge to a certain direction, but it can't account for all corner cases. The nerd mind asks ok ok but what if X, where do you draw the sharp line between X and Y? It doesn't matter. If it comes up, a court will decide it based on all available common sense and the implicit values of the culture.
In the cases where someone seemingly gets away with "rules-lawyering", then it's not purely their genius logic-brain that wins, but there is some kind of slanted playing field that's not really available to you. Of course the line between "annoying rules-lawyering based on literal interpretation of technicalities that obviously nobody intended to be interpreted so" and something that was not anticipated initially but does fit within the rules. This decision itself is based on judgment and intuition. In life, sometimes coming up with a "technically works" thing is rewarded and lauded (math proofs, pathological counterexamples, cracking an encryption library via side-channel attacks), other times you get an eye-roll and that's obviously cheating and wasn't meant (e.g. courts of law and fun at parties).
I'm close to you on that opinion, but there's another factor: Life and its sustenance. There's a lot of mechanisms in the body to ensure that life continues, including pain and desire. But the fact is resources that sustain life are finite. There's a lot of proxies for the act of acquiring such resources and laws like copyright is the legal framework for these proxies.
It's basically creating value out of nowhere in lieu of resources that are truly valuable, but inconvenient to trade directly. But then like a metrics that got corrupted (I forgot the name of the law for that), there are other that are trying to game the system (and succeeding) so that they can maximize their share.
Copyright is not "ownership of numbers". "Intellectual property" is a misnomer. Copyright is an instrumental tool to achieve specific socially desirable things, namely the flourishing of scientific and artistic activity. It's a relatively modern creation, born of enlightenment-style principles in the 18th century. If it were still used according to that spirit, we'd have less problems.
I admit thinking this way is tempting, but in your model the number represents some kind of language, whether human-readable or machine-readable. If we accept the number is a non-lossy encoding of some language, we reach an equivalency stating there is no creating, just discovering language "through convoluted mental and technological processes". But can we really equate language and knowledge? I believe Gödel proved that we cannot, in the sense that there is no "perfect" way to encode knowledge in a system of consistent axioms. Ergo, no matter how eloquently you describe your invention of "the wheel", it is by its nature incomplete and imperfect. Some part of the knowledge will always be tacit.
This conflates mathematical existence with actual instantiation. A 2gb integer might be definable, but until someone encodes a particular arrangement of bits and gives it context, it doesn’t exist in any practical sense. We don’t treat all future novels as "already written" just because their ASCII codes can be mapped to integers.
How to find a nice SHA1 hash? How do keyword search in this list? Search and discovery of quality are unsolved scientific challenges. Fascinating stuff.
At our university lab we've been working on this for 25 years. Building a search engine is the easy part. Keeping a federated server with a billion users running is unsolved. Creating a fully -serverless- decentralised search engine is possible, you also need self-funding economy. Seems we're one of the few labs worldwide to still make actual operational prototypes of this stuff. More shameless self promotion:
"SwarmSearch: Decentralized Search Engine with Self-Funding Economy" [0]
Really handy to have s search engine to search this webpage with
45,671,926,166,590,716,193,865,151,022,383,844,364,247,891,968 pages and the rest of the web (no spyware, no tracking).
If you’re interested in mass market adoption rather than just proving the theory, you will need to change the name. “LimeWire” is fun. “SwarmSearch” sounds like a biblical plague.
I would say that that’s a valid _model_ we can use to describe creation, much like how maths is a model we use to describe the universe. However, whether maths IS the universe or creation IS discovery are more of a philosophical question, possibly an unanswerable one, that people will have many varying opinions on.
And that’s without me asking you to define “real”, which would be another rabbit hole.
I think Library of Babel by Borges is a static manifestation of Turing complete behaviour via the fact that some L-systems are Turing complete.
or put another way. Where in the Library of Babel, does the real Hamlet reside? If we consider finding and replacing names with other names, is it still a Hamlet? And if we bring the full force of edit operations and do these in a reversible manner, then where does the actual Hamlet reside? An equivalence class of Hamlet?
> If you had an oracle that could tell you how to get to the book you need, the navigation instructions to get to the book will be at least as long as the book, on average.
This isn't quite true. Natural language text compresses extremely well and you would only need length equivalent to the compressed form, not the original form. And if you wanted to go further, you could use a mapping where extremely short strings map to known popular books and only unknown works have longer encodings.
I suppose this would work if the library was arranged such that comprehensible books were closer to the "origin". The workings of the "real" library of babel are supposed to be more inscrutable though.
But if I built one, it would totally work that way.
I wonder if there is some way to create a latent-space Library of Babel in which you only find incoherent gibberish with extremely long keys, with the shortest ones pointing specifically to the most common/likely strings of text, in manageable computational complexity.
Reproducing the text of a book in the library is a synonym for identifying the book. So this is really called "text compression", which is a well-studied field.
In a library of all possible strings, this is just text compression (as the other comment observes). But in a finite library it gets even simpler, in a cool way! We can treat each text as a unique symbol and use an entropy encoding (eg Huffman) to assign length-optimized key to each based on likelihood (eg from an LLM). Building the library is something like O(n log n), which isn't terrible. But adding new texts would change the IDs for existing texts (which is annoying). There might be a good way to reserve space for future entries probabilistically? Out of my depth at this point!
Another way of looking at it is that the library of Babel would be less useful than an equivalent quantity of blank paper. For example, you could use it to print books in English instead of gibberish. Multiple copies of those books, even.
> There is no validation that an infohash corresponds to a real torrent—any client can announce anything. Many crawlers and indexers continuously pick random or sequential infohashes and announce themselves so they can later detect other announcers, and malicious clients or poorly written bots can spam the network with anything they like.
There are also valid clients for completely unrelated protocols using the BitTorrent DHT to find each other.
Which? I'm always fascinated by the use of public p2p nets to serve other protocols. The first complete standalone program I wrote was a gnutella p2p client.
I have the same fascination. You might find https://github.com/dmotz/trystero quite interesting - it's fun to play around with, also can use torrent DHT for discovery.
For a more practical version (containing only infohashes that are observed on the dht) there is bitmagnet [1]. No public instances though, you have to self-host
Filtered how? By some keywords I don't want to know? What about encrypted zips of CSAM? There's no way to filter that in reality.
If you want to learn more about why and you can either speak German or can handle youtubes auto translate i recommend this documentation on the matter[0]. The Pedo Criminals are using scene methods to share their illegal content.
Yes, a simple keyword list in the classifier, matched on the torrent name and file names. Easy enough to find in the source if you look for it. That filter won't help against people uploading CSAM as documents.7z. But any filter that would want to do something against that would require downloading the content, which would be even more illegal (in addition to being wildly impractical)
bitmagnet only has the info you get by looking up the infohash in the dht, which is basically the same info that's stored in a .torrent file: a name, a list of files with offsets and paths, and a bunch of block hashes. That's not a lot to go on, and e.g. doesn't tell you if the zip is encrypted
I guess you could filter all torrents that include just zips/rars/7zips. That would exclude a lot of harmless content. Probably too much harmless content to make it a default, but if you only care about hollywood releases it would be a useful filter
If there was a public list of hashes of (8/18KiB blocks of) CSAM content that would be useful for a filter, but I don't think such a thing exists
Does running an indexer and crawler help make the content available to others, or why would this be legally risky? Why would anyone care about what kind of Docker container I run on my home server?
Does anybody know what they are using in the browser to perform DHT?
In theory this could be used to share torrent links by a different reference (ideally you could also add an anchor too). Somebody else could have a page that takes keywords and points you to pages hosted on the site.
The page is making a WebSocket connection to the server and getting the peer info through the WebSocket connection. I think the magic happens on the server.
DHT crawlers/indexers already exist to perform that function; they crawl and store infohashes (+ metadata when they receive it) and allow users to search that metadata to return relevant infohashes
By announcing itself, the indexer makes itself more likely to be handed out as a peer to anyone else interested in that infohash. Every connection attempt it subsequently receives is evidence of another peer announcing or joining that torrent. In effect, it "baits" peers into revealing themselves
> That's not detecting "announcers", but maybe more like detecting "indexers".
I think you’re correct, as the secondary freebooting indexers are adding their tracker(s) after the fact of the private torrent’s creation/origination to the original prefilled list of trackers, and inserting their tracker(s) to the reuploaded, usually public, torrent, and sometimes even removing the original private trackers so as to not phone home and tell on themselves.
I’m happy to be corrected, but private trackers typically bind the downloading IP of the torrent to the announcing tracker to validate legitimate clients. Private trackers don’t consider any extra trackers (announcers in this context) as valid or authorized. I have heard that modded BitTorrent clients can intentionally misreport upload stats to fudge the numbers for gaming your quota, as many private trackers/torrent sites enforce a positive >1.0 or higher minimum ratio.
I’ve heard of ways that folks with legitimate access to the private torrent tracker and torrents clone the IPs of other clients and then use a secondary torrent client to request blocks, bypassing the tracker entirely and not reporting any downloads (or uploads, for that matter), so the quota of the first legit client is not affected positively or negatively.
Assuming the web server does not actually store and serve pages in a conventional sense, but rather acts like an application that can render the results of parsing and processing user's input, I wonder what are legal implications.
I wonder how hosting a torrent is different to google showing a link to a pirated movie, both are just holding data that tells you where to find the content, not the content itself
I think Google is expected to abide DMCA takedowns in such cases, but IANAL. My understanding is that even an indirect reference (such as a link or infohash) is a DMCA violation.
the infohash isn't copyrighted, so it's not illegal information in and of itself. serving the infohash isn't serving the torrent, and serving the torrent is also not serving copyrighted material. I believe that downloading is still illegal absent a fair use exemption but it's rarely prosecuted because you have to prove the absence of the exemption. It's uploading copyrighted content that's actually illegal and also easy to prosecute, so it's seeders that usually get bopped.
The site doesn't publish any, except the two legal torrents that are on the front page. Any others you have to either request specifically, or are simply randomly generated.
I don't understand why so many people seem so fascinated by constructions like the library of Babel. Yes it contains the answers to all your questions, but there are some significant drawbacks.
* It has more wrong information than right information, with no way to tell the difference.
* If you had an oracle that could tell you how to get to the book you need, the navigation instructions to get to the book will be at least as long as the book, on average.
The Library of Babel made me aware that choosing/finding is not super distinct from making/creating. Or discovery and invention. In math, there is distinction between "there exists" and "we can construct", but "we can construct" is similar to "we can find".
I don't think they're equivalent. I think invention and creation aren't actually real. There is no "making" or "creating" when it comes to intellectual work.
All computer files are sequences of bits. All sequences of bits are integers. All integers already exist in the infinite set of natural numbers. I can even calculate how big those numbers are given their bit count.
We are merely discovering numbers through convoluted mental and technological processes. All our mental exertions result in the discovery of a number. This comment is a number.Yes, I mean exactly this type of insight. Basically taking a digital photo with a camera technically also just picks out the "address" of your current environment within the space of all images. Any 4K 2-hour-length feature film in a digital format is also just an address in the space of all possible videos. The director, the actors, the whole crew did all that work in order to select that point from the space of possibilities, they didn't "create" anything. That movie already existed.
Of course this is silly, but interesting nonetheless. And we routinely speak about such high-dimensional spaces in research and engineering. Or we can imagine optimization as traversing a pre-existing search space. It may be structured as a graph or perhaps a Euclidean space. And in that space we can imagine a loss surface, that sits there in peace all along, with its global minimum somewhere. And instead of "constructing" a solution, we are simply hiking in this space and trying to spot that valley. But this is a bit fictional. We never physically "instantiate" this surface. It's an imagined abstraction. In reality we just have a vector and some rules as to how we change that vector. But we can imagine those changes to be movements in an imagined space.
It's like the idea that the sculptor doesn't create the sculpture, the sculpture was there all along, he just had to remove the superfluous matter to reveal what was already there (i.e. the atoms belonging to the final sculpture).
The most interesting thing is kind of on the border, between these absurdly large spaces and the more manageable ones that are feasible to enumerate.
Another similar mindblow thing was when I forgot the password to a file that I encrypted. It's a fascinating thing that the bit pattern on the disk is functionally random now, and cracking it would take longer than the age of the universe. But if only I knew the password, it would only take just a second. There is a definite sequence of keystrokes I can execute to bring the universe in a state where the content will appear on my screen, it's so close, yet it's so-so far if you don't remember the password. Just a little difference in your brain state and it flips from trivial to hopeless.
PS, if you like thinking about such things, I recommend Meta-Math by Gregory Chaitin, it's very fun (providing an address VS constructing the thing is basically the gist of algorithmic information theory).
Yeah I agree with you.
> It's like the idea that the sculptor doesn't create the sculpture, the sculpture was there all along, he just had to remove the superfluous matter to reveal what was already there (i.e. the atoms belonging to the final sculpture).
I understand this argument but I have far more trouble applying this logic to real things. I'm not sure the same logic applies once the information is instantiated in the real world as a physical object. I haven't thought very deeply about it. I think the true sculpture exists only in the ideal world and the real world object is merely an approximation of it.
> Of course this is silly
It's an existential issue for me. At some point it became a political issue. I became a copyright abolitionist because of this insight. Copyright is logically reducible to monopolistic ownership of numbers. The sheer absurdity of it led me to reject the very idea of intellectual property as delusional nonsense.
I'm not sure the law has ever been concerned with logical reducibility. Context that can't easily been defined objectively has always been a part of legal systems, and arguably is a feature rather than a bug. Stuff like the "reasonable person" standard are intentionally flawed concepts that allow laws to exist without needing to define every possible permutation of human behavior up front. This obviously doesn't mean that you won't necessarily look at everything and decide to be an anarchist because of how convoluted it all is, but I don't think that being mathematically inconsistent is particularly unique to copyright in the legal system.
Exactly, it's a common failure mode for math/programming-minded people when encountering the law. But the law is not like a compiler, mechanically following some fully-specified set of rules.
The legal system is rather the spiritual successor of the original "system" where a wise Solomon-like elder would adjugate the issue based on their best judgment and intuition and customs, ideally seeking peace and social satisfaction and future harmony. Codified law channels this into some more pre-shaped form, but the fuel of the legal system is still the human judgment and common sense at the core. Often the law basically just prompts and nudges the judgment of the jurors or judge to a certain direction, but it can't account for all corner cases. The nerd mind asks ok ok but what if X, where do you draw the sharp line between X and Y? It doesn't matter. If it comes up, a court will decide it based on all available common sense and the implicit values of the culture.
In the cases where someone seemingly gets away with "rules-lawyering", then it's not purely their genius logic-brain that wins, but there is some kind of slanted playing field that's not really available to you. Of course the line between "annoying rules-lawyering based on literal interpretation of technicalities that obviously nobody intended to be interpreted so" and something that was not anticipated initially but does fit within the rules. This decision itself is based on judgment and intuition. In life, sometimes coming up with a "technically works" thing is rewarded and lauded (math proofs, pathological counterexamples, cracking an encryption library via side-channel attacks), other times you get an eye-roll and that's obviously cheating and wasn't meant (e.g. courts of law and fun at parties).
I'm close to you on that opinion, but there's another factor: Life and its sustenance. There's a lot of mechanisms in the body to ensure that life continues, including pain and desire. But the fact is resources that sustain life are finite. There's a lot of proxies for the act of acquiring such resources and laws like copyright is the legal framework for these proxies.
It's basically creating value out of nowhere in lieu of resources that are truly valuable, but inconvenient to trade directly. But then like a metrics that got corrupted (I forgot the name of the law for that), there are other that are trying to game the system (and succeeding) so that they can maximize their share.
Copyright is not "ownership of numbers". "Intellectual property" is a misnomer. Copyright is an instrumental tool to achieve specific socially desirable things, namely the flourishing of scientific and artistic activity. It's a relatively modern creation, born of enlightenment-style principles in the 18th century. If it were still used according to that spirit, we'd have less problems.
Reminds me of the DeCSS t-shirts from back in the day…
I admit thinking this way is tempting, but in your model the number represents some kind of language, whether human-readable or machine-readable. If we accept the number is a non-lossy encoding of some language, we reach an equivalency stating there is no creating, just discovering language "through convoluted mental and technological processes". But can we really equate language and knowledge? I believe Gödel proved that we cannot, in the sense that there is no "perfect" way to encode knowledge in a system of consistent axioms. Ergo, no matter how eloquently you describe your invention of "the wheel", it is by its nature incomplete and imperfect. Some part of the knowledge will always be tacit.
> Some part of the knowledge will always be tacit
See also https://en.wikipedia.org/wiki/What_the_Tortoise_Said_to_Achi...
This conflates mathematical existence with actual instantiation. A 2gb integer might be definable, but until someone encodes a particular arrangement of bits and gives it context, it doesn’t exist in any practical sense. We don’t treat all future novels as "already written" just because their ASCII codes can be mapped to integers.
I said all novels already exist. That's different from claiming all novels have already been written.
The claim is that humans are not "creators" but generators, very much in the random number generator sense. We are interesting number generators.
How to find a nice SHA1 hash? How do keyword search in this list? Search and discovery of quality are unsolved scientific challenges. Fascinating stuff.
At our university lab we've been working on this for 25 years. Building a search engine is the easy part. Keeping a federated server with a billion users running is unsolved. Creating a fully -serverless- decentralised search engine is possible, you also need self-funding economy. Seems we're one of the few labs worldwide to still make actual operational prototypes of this stuff. More shameless self promotion:
"SwarmSearch: Decentralized Search Engine with Self-Funding Economy" [0]
Really handy to have s search engine to search this webpage with 45,671,926,166,590,716,193,865,151,022,383,844,364,247,891,968 pages and the rest of the web (no spyware, no tracking).
[0] https://arxiv.org/abs/2505.07452
If you’re interested in mass market adoption rather than just proving the theory, you will need to change the name. “LimeWire” is fun. “SwarmSearch” sounds like a biblical plague.
I would say that that’s a valid _model_ we can use to describe creation, much like how maths is a model we use to describe the universe. However, whether maths IS the universe or creation IS discovery are more of a philosophical question, possibly an unanswerable one, that people will have many varying opinions on.
And that’s without me asking you to define “real”, which would be another rabbit hole.
Everyone is aware of this. Sites like this aren't created to be useful. They are created to be an amusement, a joke.
To your first bullet, I believe this is one of the central points of the original Borges story :)
I think Library of Babel by Borges is a static manifestation of Turing complete behaviour via the fact that some L-systems are Turing complete. or put another way. Where in the Library of Babel, does the real Hamlet reside? If we consider finding and replacing names with other names, is it still a Hamlet? And if we bring the full force of edit operations and do these in a reversible manner, then where does the actual Hamlet reside? An equivalence class of Hamlet?
> If you had an oracle that could tell you how to get to the book you need, the navigation instructions to get to the book will be at least as long as the book, on average.
This isn't quite true. Natural language text compresses extremely well and you would only need length equivalent to the compressed form, not the original form. And if you wanted to go further, you could use a mapping where extremely short strings map to known popular books and only unknown works have longer encodings.
I suppose this would work if the library was arranged such that comprehensible books were closer to the "origin". The workings of the "real" library of babel are supposed to be more inscrutable though.
But if I built one, it would totally work that way.
Kolmogorov’s library
I wonder if there is some way to create a latent-space Library of Babel in which you only find incoherent gibberish with extremely long keys, with the shortest ones pointing specifically to the most common/likely strings of text, in manageable computational complexity.
Reproducing the text of a book in the library is a synonym for identifying the book. So this is really called "text compression", which is a well-studied field.
In a library of all possible strings, this is just text compression (as the other comment observes). But in a finite library it gets even simpler, in a cool way! We can treat each text as a unique symbol and use an entropy encoding (eg Huffman) to assign length-optimized key to each based on likelihood (eg from an LLM). Building the library is something like O(n log n), which isn't terrible. But adding new texts would change the IDs for existing texts (which is annoying). There might be a good way to reserve space for future entries probabilistically? Out of my depth at this point!
That's arguably just a regular library :)
Another way of looking at it is that the library of Babel would be less useful than an equivalent quantity of blank paper. For example, you could use it to print books in English instead of gibberish. Multiple copies of those books, even.
I am reminded of this SMBC comic
https://www.smbc-comics.com/comic/the-library-of-heaven
Thank you captain obvious.
At your service.
> There is no validation that an infohash corresponds to a real torrent—any client can announce anything. Many crawlers and indexers continuously pick random or sequential infohashes and announce themselves so they can later detect other announcers, and malicious clients or poorly written bots can spam the network with anything they like.
There are also valid clients for completely unrelated protocols using the BitTorrent DHT to find each other.
Which? I'm always fascinated by the use of public p2p nets to serve other protocols. The first complete standalone program I wrote was a gnutella p2p client.
iroh does too: https://www.iroh.computer/docs/concepts/discovery
I have the same fascination. You might find https://github.com/dmotz/trystero quite interesting - it's fun to play around with, also can use torrent DHT for discovery.
https://github.com/pubky/pkarr is another one
Very cool, reminds me of the library of Babel (of which you also made a version! [1]).
I made something similar a while ago, the Hdd of Babel [2], which contains all possible files(*) , and wrote down some thoughts on it [3].
I really like how it makes us think about the nature of information.
[1] https://libraryofbabel.app/
[2] https://mkaandorp.github.io/hdd-of-babel/
[3] https://dev.to/mkaandorp/this-website-contains-pictures-of-y...
I think this would be an even better joke if the site was a setup for plausible deniability for piracy.
"I didn't share that! It was on infohash.lol first!"
For a more practical version (containing only infohashes that are observed on the dht) there is bitmagnet [1]. No public instances though, you have to self-host
1: https://github.com/bitmagnet-io/bitmagnet
how to go straight to jail 101
You are only downloading metadata, and csam content is filtered. But yes, I would also rate it as a legally risky activity
> csam content is filtered
Filtered how? By some keywords I don't want to know? What about encrypted zips of CSAM? There's no way to filter that in reality.
If you want to learn more about why and you can either speak German or can handle youtubes auto translate i recommend this documentation on the matter[0]. The Pedo Criminals are using scene methods to share their illegal content.
[0] https://www.youtube.com/watch?v=Ndk0nfppc_k
Yes, a simple keyword list in the classifier, matched on the torrent name and file names. Easy enough to find in the source if you look for it. That filter won't help against people uploading CSAM as documents.7z. But any filter that would want to do something against that would require downloading the content, which would be even more illegal (in addition to being wildly impractical)
Would it matter if it's metadata-only until you download?
why not just exclude encrypted zips?
bitmagnet only has the info you get by looking up the infohash in the dht, which is basically the same info that's stored in a .torrent file: a name, a list of files with offsets and paths, and a bunch of block hashes. That's not a lot to go on, and e.g. doesn't tell you if the zip is encrypted
I guess you could filter all torrents that include just zips/rars/7zips. That would exclude a lot of harmless content. Probably too much harmless content to make it a default, but if you only care about hollywood releases it would be a useful filter
If there was a public list of hashes of (8/18KiB blocks of) CSAM content that would be useful for a filter, but I don't think such a thing exists
> If there was a public list of hashes of (8/18KiB blocks of) CSAM content that would be useful for a filter, but I don't think such a thing exists
But wouldn't that just be a list of CSAM to look up?
Does running an indexer and crawler help make the content available to others, or why would this be legally risky? Why would anyone care about what kind of Docker container I run on my home server?
Love this idea of generating pages based on some strictly defined enumeration. Reminds me of https://everyuuid.com/
Me too. That's listed as an inspiration on the index page!
Or every bitcoin public and private address.
https://keys.lol
Does anybody know what they are using in the browser to perform DHT?
In theory this could be used to share torrent links by a different reference (ideally you could also add an anchor too). Somebody else could have a page that takes keywords and points you to pages hosted on the site.
The page is making a WebSocket connection to the server and getting the peer info through the WebSocket connection. I think the magic happens on the server.
This is a sample of the client-side code I found handling that: https://infohash.lol/_next/static/chunks/pages/p/%5Bpage%5D-...
https://www.npmjs.com/package/bittorrent-dht is used on the server.
DHT crawlers/indexers already exist to perform that function; they crawl and store infohashes (+ metadata when they receive it) and allow users to search that metadata to return relevant infohashes
> Many crawlers and indexers continuously pick random or sequential infohashes and announce themselves so they can later detect other announcers
I can't follow the logic here. How does this detect other announcers?
By announcing itself, the indexer makes itself more likely to be handed out as a peer to anyone else interested in that infohash. Every connection attempt it subsequently receives is evidence of another peer announcing or joining that torrent. In effect, it "baits" peers into revealing themselves
The way I understand it, these extraneous infohashes are functional honeytokens.
https://en.wikipedia.org/wiki/Honeytoken
> In the field of computer security, honeytokens are honeypots that are not computer systems. Their value lies not in their use, but in their abuse.
So they are basically detecting bots that indiscriminately try to download any detected infohash, right?
That's not detecting "announcers", but maybe more like detecting "indexers".
> That's not detecting "announcers", but maybe more like detecting "indexers".
I think you’re correct, as the secondary freebooting indexers are adding their tracker(s) after the fact of the private torrent’s creation/origination to the original prefilled list of trackers, and inserting their tracker(s) to the reuploaded, usually public, torrent, and sometimes even removing the original private trackers so as to not phone home and tell on themselves.
I’m happy to be corrected, but private trackers typically bind the downloading IP of the torrent to the announcing tracker to validate legitimate clients. Private trackers don’t consider any extra trackers (announcers in this context) as valid or authorized. I have heard that modded BitTorrent clients can intentionally misreport upload stats to fudge the numbers for gaming your quota, as many private trackers/torrent sites enforce a positive >1.0 or higher minimum ratio.
I’ve heard of ways that folks with legitimate access to the private torrent tracker and torrents clone the IPs of other clients and then use a secondary torrent client to request blocks, bypassing the tracker entirely and not reporting any downloads (or uploads, for that matter), so the quota of the first legit client is not affected positively or negatively.
I wonder how many times on average you'd need to click the "random" button in order to stumble on a page that contains a real torrent.
So there is almost zero chance that opening up a particular page is going to land on an actual torrent.
shades of my younger days on kazaa, excitedly download a file called 'hacking-tool-every-possible-ip-address.txt"
Is this legal? I’m of the impression that publishing infohashes to copyrighted content is illegal under DMCA?
Assuming the web server does not actually store and serve pages in a conventional sense, but rather acts like an application that can render the results of parsing and processing user's input, I wonder what are legal implications.
I can generate a Google link with an infohash in the same fashion: https://www.google.com/search?q=1548262051907755713575797913...
It's probably as illegal as any other random number generator.
I wonder how hosting a torrent is different to google showing a link to a pirated movie, both are just holding data that tells you where to find the content, not the content itself
That was The Pirate Bay's defense and... they're still around.
neither "hosts" the content. they both just point to the destination with the content.
I think Google is expected to abide DMCA takedowns in such cases, but IANAL. My understanding is that even an indirect reference (such as a link or infohash) is a DMCA violation.
it is. same as with URLs the infringement is the actual copyrighted content not the pointing to it.
the infohash isn't copyrighted, so it's not illegal information in and of itself. serving the infohash isn't serving the torrent, and serving the torrent is also not serving copyrighted material. I believe that downloading is still illegal absent a fair use exemption but it's rarely prosecuted because you have to prove the absence of the exemption. It's uploading copyrighted content that's actually illegal and also easy to prosecute, so it's seeders that usually get bopped.
The site doesn't publish any, except the two legal torrents that are on the front page. Any others you have to either request specifically, or are simply randomly generated.