Mike supports $SPORT team the Compressors. He's so sure they are unbeatable that he accepts 50:1 bets against them. Patrick bets 100$ that they won't win this year's championship; Mike accepts. Later, the Compressors announce financial troubles and can't pay the fee to enter the championship, which is then won by another team. Patrick reclaims his 5000$. Mike refuses to pay saying that the Compressors have not been beaten, and the championship result does not disprove his claim that the Compressors are unbeatable, which was the spirit of the bet since the beginning. Mike offers to refund the 100$. Patrick holds that the terms of the bet were met and he deserves the 5000$.
I have a fundamental problem with the Hutter prize stating that intelligence is related to compression & then sponsoring a prize for lossless compression. Intelligence is related to lossy compression. Lossless is mainly a mechanistic act.
Marcus Hutter is free to sponsor a prize according to his definition of intelligence.
You are free to sponsor a prize according to your intelligence definition involving a lossy compression.
Are you trying to imply that people need to put up big sums of money when they want to critique someone else's definitions be taken seriously? If yes, I think that's ridiculous. If no, I can't figure out the point of your comment.
Next, say I find some predictive model of the data M, where M is composed of N bytes. Furthermore, let us say that the entropy of the dataset under the model is H bytes.
Then, if N + H < D, my model compresses the data.
It doesn't matter if the model is deterministic or probabilistic: a probabilistic model can be used to (losslessly) compress a dataset with entropy coding.
One more argument for compression being equivalent to intelligence: Across many fields of statistical machine learning, there are generalization bounds which have the form:
test error <= train error + model complexity
That is, we don't expect to do any worse on new (test) data, than the sum of the train error and the model complexity (smallest compressed size). Notice that if you interpret the train error as the entropy of the data (i.e. error under a cross entropy loss), then the models which satisfy the statistical generalization bound correspond to those which best compress the data. In other words: the model which produces the shortest description of the data is the one which is expected to generalize best.
> It doesn't matter if the model is deterministic or probabilistic: a probabilistic model can be used to (losslessly) compress a dataset with entropy coding.
But if you can choose to lose information you can obviously achieve a higher compression score. That's literally what optical & auditory compression exploits. Indeed, we know people generally don't memorize the entire Wikipedia article. Rather they convert what they learn into some internally consistent story that they can then recite at any time & each time they recite it it's even differently worded (maybe memorizing some facts that help solidify the story).
Again, I have no problem with compression and decompression being equated to intelligence provided both are allowed to be lossy (or at least one facet of intelligence). That's because you get to inject structure into the stored representation that may not otherwise exist in the original data and you get to choose how to hydrate that representation. That's why LZMA isn't "more intelligent" than ZIP - the algorithm itself is "smarter" at compression but you're not getting to AGI by working on a better LZMA.
It's also why H264 and MP3 aren't intelligent either. While compression is lossy decompression is deterministic. That's why we can characterize LLMs as "more intelligent" than LZMA even though LZMA compresses losslessly better than LLMs.
I agree with you in spirit. I just thought you might be interested in some of the technical details regarding the relationship between compression and generalization!
I'll have a paper out next week which makes your point precise, using the language of algorithmic rate--distortion theory (lossy compression applied to algorithmic information theory + neural nets).
I think another way of understanding this is through the "Value Equivalence Principle", which points out that if we are learning a model of our environment, we don't want to model everything in full detail, we only want to model things which affect our value function, which determines how we will act. The value function, in this sense, implies a distortion function that we can define lossy compression relative to.
This seems to assume that there is a tractable way to encode H efficiently, but this seems very difficult given a model that is focused on understanding the content. Ex: I can easily write a program that can do basic arithmetic, but given say a bitmap scan of elementary school math materials, such a program gives me no easy way to compress that; rather something generic like PNG (that does not know or understand the material) will far outperform.
Great point. This points to the related issue: what do we want to compress? Do we want to compress "the answer", here the arithmetic expression's solution, or do we want to compress the image?
You can formalize this with rate--distortion theory, by defining a distortion function that says what your objective is. That implies a well-defined complexity relative to that distortion function.
Okay to be clear, I've written a paper on exactly this topic which will be announced in a week or so. So you won't find anything on the subject yet, haha. But I use almost exactly this example.
>Okay to be clear, I've written a paper on exactly this topic which will be announced in a week or so. So you won't find anything on the subject yet, haha. But I use almost exactly this example.
I would use Floating Point arithmetic as the example/analogy: one trades off accuracy/precision for exactness/correct-ness when in the extremities. Answers near the more granular representations will be only be able to represented by their nearest value. If this is forsee-able, the floating point implementation can be adjusted to change where the floating point's "extremities'" are.
I used the above analogy and the following when articulating the magnitude of near-lossless-ness that large LLM's have managed to attain, especially when all of the humanities corpus is compressed into a USB flash drive; the Kolmogorov complexity re-mitted/captured is similar to a master-piece like the Mona Lisa having to be described in X many brush-strokes to achieve Y amount of fidelity.
"If it were a fact, it wouldn't be called intelligence."
I think the other way to read it is that you're fundamentally going to have to choose (by your freedom to control your algorithm's design) which of the maps from (2^|x|) -> (2^|compress(x)|) are the ones that actually end up getting no compression (ie |compress(x)| actually > |x| - bc of pigeonhole principle) and which ones are the ones that do get to be called compressed. So you do have an implied kind of lossiness w.r.t. to what parts of the input domain are actually compressed.
Isn’t the intelligence shown by compressing lossless the scheme you use? Applying the algorithm is the easy part, the proof of intelligence is inventing the algorithm which compresses.
Hutter put up the prize as a way of getting people to approximate his AIXI algorithm, which he already considers to be the pinnacle of artificially intelligence. That's also why lossy compression isn't interesting. It'd be an approximation of an approximation and not particularly useful for hutter's purpose.
> his AIXI algorithm, which he already considers to be the pinnacle of artificially intelligence
This paper from Hutter in 2015 https://arxiv.org/abs/1510.04931 considers AIXI to be "subjective", since the choice of Universal Turing Machine is left unspecified:
> We show that Legg-Hutter intelligence and thus balanced Pareto optimality is entirely subjective, and that every policy is Pareto optimal in the class of all computable environments. This undermines all existing optimality properties for AIXI. While it may still serve as a gold standard for AI, our results imply that AIXI is a relative theory, dependent on the choice of the UTM.
Sorry, if you're trying to hustle people by xstging $100 per try, don't catch the sleight of hand in the "multiple files" question, and accept, you were beaten at your own game, fair and square.
Not in any significant way. The decompressor could be changed to require you to feed the files into it in the correct order or expect some other sorting.
What you're saying is like saying that you encoded info in filenames because decompress.sh expects a file "compressed.dat" to exist. It's not describing any meaningful part of the scheme.
The filenames contain information that you need in some way for the scheme to work.
You are combining different parts and inserting a missing byte every time you combine the files. You need to combine the parts in the correct order, and the order is part of the information that makes this work.
If the ordering isn't coming from filenames, it needs to come from somewhere else.
You could do the same spitting trick but only split at progressively increasing file lengths at the character '5'. The "compression" would be worse, so you'd need a larger starting file, but you could still satisfy the requirements this way and be independent of the filenames. The decompressor would just sort the files by increasing length before merging.
But if you change them without making them sort differently, everything is fine. He depends on the order, not the filenames. You could even remove the filenames entirely, as long as you patch the code to account for such a strange environment.
This guy clearly failed because he didn't actually do any compression, he just ab-used the filesystem to store parts of the data and then tried to argue that metadata was not data...
He clearly failed at doing any actual compression, but he did not fail at the challenge. He satisfied every single condition.
He obviously knows that. He knows exactly what compression is and that his solution is not it. He showed (successfully) that meeting the challenge as it was worded did not require compression, which the setter of the challenge didn't realize.
If Mike didn't want multiple files, he should have said no to multiple files. The fact that he wanted $100 per try shows that he just got outhustled at his own hustle. He should have been more particular about the rules, and upheld the ones he himself set.
That argument was made, and addressed in the emails. How are you defining compression? Actual compassion doesn't seem to have been a requirement in the challenge.
I didn’t quite get the method used to „compress“ the data from the article, maybe this rephrasing helps someone:
You basically split the file every time you encounter a specific character, and your compressor just combines all files it finds with the character you split by. If you split at every „X“ Char which might occur 1000 times in the file, the compressor only needs a small script which joins all files and puts an „X“ between them, which is less than 1000 bytes. The „trick“ is storing the location of the Xs you removed in the file sizes of the individual files.
Both are needed. If you want to transmit this solution from one computer to another you need to store the size of each file (or insert a fancy delimiter that takes even more space).
I wonder if it would have been possible to win the challenge legitimately?
If a randomly-generated file happened to contain some redundancy through sheer chance, you could hand-craft a compressor to take advantage of it. This compressor would not work in general for random data, but it could work for this one particular case.
It's a bet worth taking, because the payoff, 50:1 ($5,000 to $100), is pretty good. Play the game 50 times and you might get a file you could compress.
The challenge, then, would be for the person offering the bet to generate a really random file that contained no such redundancy. That might not be easy.
I think you can make some argument about why this isn't possible at 50:1 odds.
A plausible "decompressor" is at least, say, 30 or 100 bytes, so the random file needs to have 30 bytes less entropy than you expected, which happens with probability X where X << 1/50. Sum over the whole domain of reasonable decompressors, and you still don't get there.
This argument could do with more rigor, but I think it's correct. Give me 100 million to 1 odds, though, and I'll take my chances trying to brute force a compressor.
This is actually an extremely interesting question. 'Weak' files that are more easily compressable than others certainly exist, but with low probability.
For example, the all-zeros file is a member of the set of all random 3 megabyte files, and it would certainly be possible to compress that, if by great good fortune you were lucky enough to receive it - albeit something that is unlikely to ever happen in the possible lifetime of the universe, given current physical theories.
Is it possible to quantify this idea of a 'weak' file more accurately?
Yes, you can think of it in terms of (WLOG think of any uniquely-decodable code) prefix-free codes. They're uniquely decodable - for things that are not uniquely decodable, that implies that you could put overlapping codes over that symbol. If you make a matrix like this where the rows are the bitstrings of length b and columns are individual bits:
then you have 2^b rows. Suppose you look at the sub-bitstrings of length k, k < b. They all appear the same number of times, if you count them wherever they appear at any position in across the entire matrix.
However, you also know, for sure, that, if a prefix-free code appears in a particular row, that means since it's impossible to overlap with anything else in that row at its span. What does that imply? That the prefix-free codes have a greater 'occupancy percentage' of a single row than all other sub-bitstrings. That means that you must find fewer of them, on average, inside of a single row.
But since we know that all sub-bitstrings appear the same number of times throughout the entire matrix, what else can we deduce? That the prefix-free codes must appear /over more rows / on average, if they cannot appear as many times while looking at bit positions /along the columns/. That means they will occur as a sub-pattern in full-bitstrings more often than typical random sub-patterns.
So weakness here corresponds to the presence of patterns (prefix-free codes) that are:
- non-overlapping within bitstrings
- widely distributed across bitstrings
- due to their wide distribution, there's a higher chance of encountering these patterns in any given random file
- therefore, weak files are more compressible because they contain widely-distributed, non-overlapping patterns that compression algorithms can take advantage of
One thing you can do, as the other commenter pointed out, is consider entropy of the file.
However, this restriction is too much for the purposes of this challenge. We don’t actually need a file with low entropy, in fact I claim that a weak file exists for files with entropy 8 (the maximum entropy value) - epsilon for each epsilon > 0.
What we actually need is a sufficiently large chunk in a file to have low entropy. The largeness is in absolute terms, not relative terms.
A very simple file would be taking a very large file with maximum entropy and adding 200 0’s to the end. This would not decrease the entropy of the file much, but it gives way to a compression algorithm that should be able to save ~100 bytes
Note that if this large chunk occurs in the middle of the file, then you will need extra space to encode that position. For example, a random bit string of length 2^n is decently likely to have a run of n zeroes. But this doesn’t help you because you need n bits just to encode where that run happens.
What if we didn't even attempt to compress the file? Theoretically, there is a random number generator + seed combination that would output the same bytes as the source file.
There is a polynomial expression that will match the file.
I wonder if expressing it would be compressible to a lower size than original file?
[edit] I’m wrong - not all sequences can be expressed with a polynomial.
This reminds me of a data compression scheme I came up with once:
Treat an n bit file as a polynomial over the finite field with characteristic 2. Now, there are some irreducible polynomials in this field, but many polynomials have factors of x and of (x+1). Factor the polynomial into P(x)x^n (x+1)^m. and just collect these terms, storing only P, n, and m.
Actually this would work perfectly if you knew it was generated in a single pass by a known random number generator and you had tons of time to brute force it.
If the file were generated by a natural source of entropy then forget it.
Or even if modified in a trivial way like adding 1 to every byte.
Somewhere (discussed on HN) someone devised a "better-than-perfect" compressor. Most inputs get compressed (smaller than input), except for one input that does not. That one input is cryptographically impossible to find - or something along those lines.
Unfortunately I can't find the article I'm describing here, maybe someone else can? It was a long time ago so I might be misrepresenting it slightly.
> likely files (eg. ASCII, obvious patterns, etc) become smaller
Likely files for a real human workload are like that, but if "most inputs" is talking about the set of all possible files, then that's a whole different ball game and "most inputs" will compress very badly.
> unlikely files become bigger
Yes, but when compressors can't find useful patterns they generally only increase size by a small fraction of a percent. There aren't files that get significantly bigger.
You could design the opposite, a system where ultra-rare files get compressed a lot but most don't compress.
But as stated, what you said is not possible. To compress by a single byte, only 1/256 files can be compressible, or you'd have multiple different inputs compressing down to the same output.
There's another interesting loophole to these general compression challenges. A universal Turing machine will necessarily compress some number of strings (despite the fact that almost all strings are incompressible). The set of compressible strings varies depending on the choice of UTM, and if your UTM if fixed, you're out of luck for random data. But if the UTM is unspecified, then there exist an infinite number of UTMs that will compress any specific string.
To some extent, this resembles the approach of "hiding the data in the decompressor". But the key difference here is that you can make it less obvious by selecting a particular programming language capable of universal computation, and it is the choice of language that encodes the missing data. For example, suppose we have ~17k programming languages to choose from—the language selection itself encodes about 14 bits (log2(17000)) of information.
If there are m bits of truly random data to compress and n choices of programming languages capable of universal computation, then as n/m approaches infinity, the probability of at least one language being capable of compressing an arbitrary string approaches 1. This ratio is likely unrealistically large for any amount of random data more than a few bits in length.
There's also the additional caveat that we're assuming the set of compressible strings is algorithmically independent for each choice of UTM. This certainly isn't the case. The invariance theorem states that ∀x|K_U(x) - K_V(x)| < c for UTMs U and V, where K is Kolmogorov complexity and c is a constant that depends only on U and V. So in our case, c is effectively the size of the largest minimal transpiler between any two programming languages that we have to choose from, and I'd imagine c is quite small.
Oh, and this all requires computing the Kolmogorov complexity of a string of random data. Which we can't, because that's uncomputable.
Nevertheless it's an interesting thought experiment. I'm curious what the smallest value of m is such that we could realistically compress a random string of length 2^m given the available programming languages out there. Unfortunately, I imagine it's probably like 6 bits, and no one is going to give you an award for compressing 6 bits of random data.
The problem with the UTM argument designed for a specific string is that the UTM size grows without bound. You also don’t need a UTM, which will be much larger than a simple TM or finite automaton. And now we’re back to the original problem: designing a machine capable of compressing the specific string and with smaller description.
The issue with the invariance theorem you point out always bugged me.
Let s be an algorithmically random string relative to UTM A. Is it the case that there exists some pathological UTM S, such that K(s|S) (the Kolmogorov complexity of s relative to S) is arbitrarily small? I.e. the blank print statement of S produces s. And there always exists such an S for any s?
Is there some way of defining a meta-complexity measure, the complexity of some UTM without a reference UTM? It seems intuitive that although some pathological UTM might exist that can "compress" whichever string you have, its construction appears very unnatural. Is there some way of formalizing this "naturalness"?
I would have expected it to be possible to compress a single arbitrary random file with a small program. I would have thought an RNG could generate a file with some weakness that allows you to compress it, although said compressor would be worse at other inputs.
Likewise this challenge would have been stronger if the requirement was to compress two provided arbitrary files :P
I thought the same thing. Surely a random binary sequence will have some (small) repeating sequences? So as long as we can find them and (efficiently) mark their locations, we can have some small size reduction.
Which is basically what he’s done here. He’s cheating by marking the location of the repeating sequence using unique files, rather than some other actually more efficient location system.
This is a fun discussion! I think it helps to make compression feel more approachable for armchair enthusiasts.
This is hilarious, but Mike’s behavior boils my blood. To switch the rules when fairly beaten is just so scummy!
Of course Patrick’s solution occurred to me immediately once he started talking about splitting up the file. :)
Then I got to wondering if it would be possible to compress a large enough file and save a few bytes, in the spirit of what Mike actually wanted. For instance, if you tried encrypting with a bunch of random keys, you’ll find some such that the result ends with n 0s (say), so then your result could be something like encrypted + key + n. Typically you’d expect to find that the length of the encryption key would be roughly the length of the number of 0s, so you wouldn’t actually net any savings. But I think if you searched long enough, you might “get lucky” and find a key that generated more 0s then the length of the key, so you could theoretically win the challenge. Of course you’d have to somehow embed a program which could decrypt and reassemble everything in a tiny amount of space, but it still doesn’t seem impossible like Mike seems to suggest. Perhaps just requires an intractable amount of computing power. :)
I’m not sure if any of that made sense. I think I’m using the wrong terms for things.
I remember reading from a previous thread how yes you can do something like that (or lots of other techniques) but they will only work on a small percentage of files.
That sometimes the smallest decompressor (including the key, in your example) will result in overall less space, but usually it will result in more (because the smallest key that generates 10 zeros happens to be 14 digits long, etc.).
And so ultimately the more interesting question becomes about those probabilities, and therefore what is the right entry price for the bet to be even?
Story time. In our Data Structures and Algorithms class during my undergrad, we had a similar challenge posed to us by an associate professor who was quite proud of the fact that it had never been beaten.
The sophistication of my compression understanding at the time extended to lempel-ziv and huffman and well... that's about it. So I was like, “Okay, I’m not going to create the world’s best compression method on a whim.” So I started thinking: if I can’t compress the file, maybe I can just fetch it from somewhere else.
The first thing I tried was writing a program that searches online by uploading the file onto a public FTP. From there, I could fetch it. Simple enough, right?
Well, here’s the issue with that: there’s no guarantee I’d even have internet connectivity. That’s probably the first thing he’d block.
So, he gives us the file. I think it was something like 10 MB, and when I looked through it, the entropy was high. Like, really high. Standard compression tools like 7Z or rar wanted NOTHING to do with it.
Then I started thinking again. Maybe this guy’s running all the submissions back-to-back. He’s probably processing them in sequence because he’s got tons of entries coming in. So I thought: if I can’t download the file, maybe there's a chance it's already on the machine from somebody's previous simulation - I can find it somewhere on the hard drive. My program might not be a great decompression system, but you know what it is? A fantastic file search.
So, I threw together a program that tree searched the entire disk with a simple checksum to verify I’d found the exact right file. If it matched, I’d just copy the file into the current directory and say, “I’m done.” - though if it found the file too quickly it would continue idling for a while so as to avoid suspicion. (I was also banking on the fact that someone else’s program might’ve run before mine, leaving the file lying around somewhere for me to grab.)
Of course, I couldn’t make this look too obvious. I mean, imagine submitting a tiny C++ program claiming it could compress a 10 MB file into just a few kilobytes. He’d immediately know it was fake.
So, I padded the compressed binary, adding a bunch of random hexadecimal and binary junk to bulk it up to around 5 MB. It was still phenomenal—better than any other method on the market for this type of file—but at least it seemed somewhat plausible. I mean, maybe I’d "discovered" some revolutionary new compression algorithm or something and smashed the weissman score.
The contest went on for nearly a month, and the guy had a leaderboard tracking scores. Most people were hovering just slightly above standard compression rates—around 9 MB. Nothing groundbreaking, though.
So, it’s a Friday night. I finish the program, pack it all up, and send it off. Then I go to bed. Needless to say, I woke up with a quite an excited response from my professor.
Don’t offer up a game you aren’t willing to lose. This should have had a mathematical definition and required a formal proof if Mike wanted to play it this way. By allowing for word play he made this a game of interpretation and not a true mathematical game.
I would not have accepted multiple files nor scripting. Scripts offers up the entire contents of the linux operating system and this could allow you to use operating system files as partial data. To really test compression/decompression the rules need to be a lot tighter. The challenge was met and beat because the criteria was bad. It leaves open a number of dishonest approaches.
The decompression program needs to be constrained to itself and the singular file to decompress and can not utilise anything other than basic operating system functions for the purpose of reading and writing the constrained files. Any bytes the program accesses otherwise can be used as part of the total count.
It's very hard to constrain this without some form of chest being possiblemespecially if you have some years to set it up.
> Scripts offers up the entire contents of the linux operating system and this could allow you to use operating system files as partial data.
I also thought about this, but when thinking more about the argument by the challenge creator, I understood that the data in the OS doesn’t help you.
If you’re given a file with 100 bits, there are 2^100 possibilities. If you want to compress to 99 bits, there are only 2^99 possible compressed outputs. So when generating and trying to compress every 100 bit file, you’re not going to be able to compress half of them. The external data doesn’t help you unless you get to modify it after receiving the file to compress. If you can get patched merged into the Linux kernel, you could use the data, but if the data is fixed, you can’t reference the data with less data than the original file (for some cases atleast).
Edit: actually, when thinking about it: you could actually use the OS as another few bits of out of band information you get for free, which makes it possible again. If you have 8 different OS to choose from, you could use their differences to compress any file. At least theoretically, the 3 bits you gain practically probably aren’t enough to make a difference.
One possibility I considered is that your script could use an "apt install" to provide out of band data, you could put the information you need into the operating system from the universe repositories/your own package server.
suppose the data were generated by a csprng with 256 bits of state. then in fact there are just slightly more than 256 bits of entropy there (prng state plus the length of the generated file). only, you'd have a snail's chance in hell of actually finding that seed
If he really wanted to make sure he kept the money, he could just use a hardware RNG. I don’t know how common they were in 2001 but you could probably even generate some practically unpredictable random numbers using radio noise.
> With this offer, you can tune your algorithm to my data.
One can generate a 1GB file or 10GB file. It is highly likely that there is some form of a repeatable pattern in there to shave off 50-100 bytes by sliding window search.
Then the decompressor is essentially - at this index, expand this pattern. The compressed file excludes the range of pattern.
One may not always get such a pattern, but on multiple tries it’s very feasible. Payout just needs one win in 50 tries.
You could generate a 100GB file. The bigger the file the higher the chance of repeating pattern.
The challenge is won if compressed_file + decompressor is less one byte than original file.
One could have a self executing decompressor to save some file overhead bytes.
Your idea fails pretty easily. Simply do the math, and you’ll find your idea won’t work.
One argument: if your method worked, then we could compress nearly any file. This violates the pigeonhole principle.
Another example: for a 4GB file you’d need roughly a 4 byte integer to specify where the repeat started and where the second was, for 8 bytes. Then you need a length byte. Now you need a repeated 10+ bytes sequence to make this compress anything. There are 256^10=2^80 possible such 10 bytes sequence sequences, and only ~2^32 of them in a 4 gb file. So the odds of a repeated are around 1 in 2^50.
I though he would try to net profit through probability by requesting to regenerate the file and hope to get a compressible one in at least 5000/100=50 times. (Though I don't think that would work.)
The crux of the ongoing debate over the post has to do with the ambiguity of the stated requirements.
For there to be any rational discussion, you need to have proper unique definitions for what is being communicated. Without a definition (identity) no system of truth seeking will provide a definitive truth.
Most notably, "compression" has many different meanings currently in practical use. For example many people use the word compression and simply mean that the uncompressed size is more than the compressed size; lossy vs lossless doesn't matter when its good enough, all that matters is its been reduced, and an example of that might be H264 video.
Other more rigorous individuals will use it to mean entropy encoding, which has a strict definition, part of which is lossless, and in these cases they often view Shannon's theorems as gospel, and at that point they often stop thinking and may not try to sharpen their teeth on a seemingly impossible task.
These theorems are often very nuanced, with the bounds very theoretical or still undiscovered, though still seemingly correct. When someone is told something is impossible, they don't fully apply themselves to try to find a solution. Its a worn path that leads nowhere, or so they imagine.
This mindset of learned helplessness prevents a nuanced intuitive understanding of such problems and any advances stagnate. People take the impossibility at face value, and overgeneralize it (without that nuanced understanding), this is fallacy and it can be very difficult to recognize it when there is no simple way to refute it on its face.
Here is something to consider. Shannon's source coding theorem relies on the average uncertainty of probabilities as a measure of entropy (information). Some exploration is needed to understand what makes this true, or where it fails. It has been my experience that only a very few exceptional people ever attempt this process.
The main question that jumps out in my mind since compression has been a thing I've been interested in for years; is, can an average entropy in statistics tell us accurately whether this is true in systems with probabilities and distributions that are a mix of mathematical chaos but also contain regularity within their domains? Statistics can be unwieldy beasts, and act as regular sources of misunderstanding and falsehood without a very rigorous approach.
For a similar problem, people may find this article particularly relevant with regards to the chaotic 3-body problem in astrophysics.
> It's not my fault that a file system uses up more space storing the same amount of data in two files rather than a single file.
It's not interesting to see people play word games with each other in the HN comments, so why is it interesting to see this kind of disingenuous behavior? Trolls have always existed. I don't get what's interesting about it.
The act of offering a cash prize to prove you wrong kind of makes you look like a jerk. It's assumed that you will use every loop hole at your disposal. Seeing the situation reversed so elegantly makes me root for the challenger.
As Patrick said, if he had gotten a 1000-byte file and sent a 999-byte file/decompressor file back, nobody would have counted the inode data as a part of the solution. Why count it now?
Because the solution specifically was for 1 file submitted, 2 files back (compressed and decompressor). This topic ise the next level trolling that went on. The summary and ng faw, explained why these file metagames aren't useful. They obscure the lack of a compression strategy (eg if the inode equivalent was 2ft of tape) by nature of dispersing information into the fs.
> Because the solution specifically was for 1 file submitted, 2 files back (compressed and decompressor).
Patrick:
I meant can I send you a decompressor and several compressed files whose total file size is less than the original uncompressed file and from which I can regenerate the original uncompressed file
I do agree that this last sentence (the one you quoted) is very unfortunate and doesn't match the wit of the rest. A better closing remark might have been something on the lines of, “I never claimed to have performed any compression. I only claimed to have met the terms of the challenge. I asked if I can send multiple files and the response was ‘sure’. I provided the files; they meet the total file size criterion; and the script correctly reconstructs the original files.”
Take random bytes from /dev/urandom, encrypt it. There is no way any compressor can reduce the size of that file. I'm pretty sure there is no way the second player can win this game. Actually, the encryption part isn't even necessary. There is no compressor that can reduce the size of a random stream of bytes.
There is no general purpose compressor that can achieve compression on all random streams of bytes... but there is a possibility to compress a specific random stream of bytes, if you get lucky
You'd have to define what you mean by luck which would be equivalent to choosing some pattern that your compressor can recognize in a random stream of bytes and then computing the probability that pattern appeared in some random byte sequence. It's not a winnable game in any practical sense because whatever compressor you choose the probability of getting a sequence from your opponent that will conform to the patterns recognizable by your compressor is essentially 0.
The encryption part might actually be harmful: for (silly) example, if you encrypt using the Bacon cipher, which turns every letter into five letters all of which are either "a" or "b". :) You'd need to pick an encryption method that doesn't expand the text at all, not even via a header or footer block. Better to stick with bytes generated uniformly at random, in which case you are correct.
But you don’t need to compress every possible file to make playing such a game a good idea.
Suppose if you shave 1 bit off of a sample bit stream you win and you lose if that fails. Your compressor looks at the first 2 bits and if it starts 11 it starts with 1 and otherwise it starts 0 then the bit stream. Now you have a 1 in 4 chance to compress a random string by 1 bit and. A 3 in 4 chance of making it longer by 1 bit.
It’s ultimately not about compression but what odds are worth it.
Mike supports $SPORT team the Compressors. He's so sure they are unbeatable that he accepts 50:1 bets against them. Patrick bets 100$ that they won't win this year's championship; Mike accepts. Later, the Compressors announce financial troubles and can't pay the fee to enter the championship, which is then won by another team. Patrick reclaims his 5000$. Mike refuses to pay saying that the Compressors have not been beaten, and the championship result does not disprove his claim that the Compressors are unbeatable, which was the spirit of the bet since the beginning. Mike offers to refund the 100$. Patrick holds that the terms of the bet were met and he deserves the 5000$.
>$SPORT team the Compressors
They were lossless last season. :-D
The original email thread was from 2001, and it gets posted to HN periodically:
https://news.ycombinator.com/from?site=patrickcraig.co.uk
For another compression challenge that is still ongoing, try "500000€ Prize for Compressing Human Knowledge" (also known as "Hutter Prize"):
http://prize.hutter1.net/
https://news.ycombinator.com/item?id=37502329 - Hutter Prize for compressing human knowledge (2023-09-13, 215 comments)
I have a fundamental problem with the Hutter prize stating that intelligence is related to compression & then sponsoring a prize for lossless compression. Intelligence is related to lossy compression. Lossless is mainly a mechanistic act.
Marcus Hutter is free to sponsor a prize according to his definition of intelligence. You are free to sponsor a prize according to your intelligence definition involving a lossy compression.
Of course he's "free to".
Are you trying to imply that people need to put up big sums of money when they want to critique someone else's definitions be taken seriously? If yes, I think that's ridiculous. If no, I can't figure out the point of your comment.
Say we have some dataset composed of D bytes.
Next, say I find some predictive model of the data M, where M is composed of N bytes. Furthermore, let us say that the entropy of the dataset under the model is H bytes.
Then, if N + H < D, my model compresses the data.
It doesn't matter if the model is deterministic or probabilistic: a probabilistic model can be used to (losslessly) compress a dataset with entropy coding.
One more argument for compression being equivalent to intelligence: Across many fields of statistical machine learning, there are generalization bounds which have the form:
test error <= train error + model complexity
That is, we don't expect to do any worse on new (test) data, than the sum of the train error and the model complexity (smallest compressed size). Notice that if you interpret the train error as the entropy of the data (i.e. error under a cross entropy loss), then the models which satisfy the statistical generalization bound correspond to those which best compress the data. In other words: the model which produces the shortest description of the data is the one which is expected to generalize best.
> It doesn't matter if the model is deterministic or probabilistic: a probabilistic model can be used to (losslessly) compress a dataset with entropy coding.
But if you can choose to lose information you can obviously achieve a higher compression score. That's literally what optical & auditory compression exploits. Indeed, we know people generally don't memorize the entire Wikipedia article. Rather they convert what they learn into some internally consistent story that they can then recite at any time & each time they recite it it's even differently worded (maybe memorizing some facts that help solidify the story).
Again, I have no problem with compression and decompression being equated to intelligence provided both are allowed to be lossy (or at least one facet of intelligence). That's because you get to inject structure into the stored representation that may not otherwise exist in the original data and you get to choose how to hydrate that representation. That's why LZMA isn't "more intelligent" than ZIP - the algorithm itself is "smarter" at compression but you're not getting to AGI by working on a better LZMA.
It's also why H264 and MP3 aren't intelligent either. While compression is lossy decompression is deterministic. That's why we can characterize LLMs as "more intelligent" than LZMA even though LZMA compresses losslessly better than LLMs.
I agree with you in spirit. I just thought you might be interested in some of the technical details regarding the relationship between compression and generalization!
I'll have a paper out next week which makes your point precise, using the language of algorithmic rate--distortion theory (lossy compression applied to algorithmic information theory + neural nets).
I think another way of understanding this is through the "Value Equivalence Principle", which points out that if we are learning a model of our environment, we don't want to model everything in full detail, we only want to model things which affect our value function, which determines how we will act. The value function, in this sense, implies a distortion function that we can define lossy compression relative to.
This seems to assume that there is a tractable way to encode H efficiently, but this seems very difficult given a model that is focused on understanding the content. Ex: I can easily write a program that can do basic arithmetic, but given say a bitmap scan of elementary school math materials, such a program gives me no easy way to compress that; rather something generic like PNG (that does not know or understand the material) will far outperform.
Great point. This points to the related issue: what do we want to compress? Do we want to compress "the answer", here the arithmetic expression's solution, or do we want to compress the image?
You can formalize this with rate--distortion theory, by defining a distortion function that says what your objective is. That implies a well-defined complexity relative to that distortion function.
Okay to be clear, I've written a paper on exactly this topic which will be announced in a week or so. So you won't find anything on the subject yet, haha. But I use almost exactly this example.
I used the above analogy and the following when articulating the magnitude of near-lossless-ness that large LLM's have managed to attain, especially when all of the humanities corpus is compressed into a USB flash drive; the Kolmogorov complexity re-mitted/captured is similar to a master-piece like the Mona Lisa having to be described in X many brush-strokes to achieve Y amount of fidelity.
"If it were a fact, it wouldn't be called intelligence."
I think the other way to read it is that you're fundamentally going to have to choose (by your freedom to control your algorithm's design) which of the maps from (2^|x|) -> (2^|compress(x)|) are the ones that actually end up getting no compression (ie |compress(x)| actually > |x| - bc of pigeonhole principle) and which ones are the ones that do get to be called compressed. So you do have an implied kind of lossiness w.r.t. to what parts of the input domain are actually compressed.
Isn’t the intelligence shown by compressing lossless the scheme you use? Applying the algorithm is the easy part, the proof of intelligence is inventing the algorithm which compresses.
Hutter put up the prize as a way of getting people to approximate his AIXI algorithm, which he already considers to be the pinnacle of artificially intelligence. That's also why lossy compression isn't interesting. It'd be an approximation of an approximation and not particularly useful for hutter's purpose.
> his AIXI algorithm, which he already considers to be the pinnacle of artificially intelligence
This paper from Hutter in 2015 https://arxiv.org/abs/1510.04931 considers AIXI to be "subjective", since the choice of Universal Turing Machine is left unspecified:
> We show that Legg-Hutter intelligence and thus balanced Pareto optimality is entirely subjective, and that every policy is Pareto optimal in the class of all computable environments. This undermines all existing optimality properties for AIXI. While it may still serve as a gold standard for AI, our results imply that AIXI is a relative theory, dependent on the choice of the UTM.
Totally just misread "hutter1" as "hunter2".
Sorry, if you're trying to hustle people by xstging $100 per try, don't catch the sleight of hand in the "multiple files" question, and accept, you were beaten at your own game, fair and square.
I feel like if the FAQ requires not using filename shenanigans then the slight of hand was illegal the whole way.
He didn't use filenames, he used files, and if that were illegal, Mike shouldn't have accepted it.
He does use the filenames. If you change the filenames randomly (such that the files sort differently), it does not work.
Not in any significant way. The decompressor could be changed to require you to feed the files into it in the correct order or expect some other sorting.
What you're saying is like saying that you encoded info in filenames because decompress.sh expects a file "compressed.dat" to exist. It's not describing any meaningful part of the scheme.
The filenames contain information that you need in some way for the scheme to work.
You are combining different parts and inserting a missing byte every time you combine the files. You need to combine the parts in the correct order, and the order is part of the information that makes this work.
If the ordering isn't coming from filenames, it needs to come from somewhere else.
You could do the same spitting trick but only split at progressively increasing file lengths at the character '5'. The "compression" would be worse, so you'd need a larger starting file, but you could still satisfy the requirements this way and be independent of the filenames. The decompressor would just sort the files by increasing length before merging.
That's a neat idea.
> such that the files sort differently
But if you change them without making them sort differently, everything is fine. He depends on the order, not the filenames. You could even remove the filenames entirely, as long as you patch the code to account for such a strange environment.
This guy clearly failed because he didn't actually do any compression, he just ab-used the filesystem to store parts of the data and then tried to argue that metadata was not data...
But FYI someone else actually managed to compress that exact same data: https://jasp.net/tjnn/2018/1.xhtml
He clearly failed at doing any actual compression, but he did not fail at the challenge. He satisfied every single condition.
He obviously knows that. He knows exactly what compression is and that his solution is not it. He showed (successfully) that meeting the challenge as it was worded did not require compression, which the setter of the challenge didn't realize.
If Mike didn't want multiple files, he should have said no to multiple files. The fact that he wanted $100 per try shows that he just got outhustled at his own hustle. He should have been more particular about the rules, and upheld the ones he himself set.
Should have charged $100 a file.
That argument was made, and addressed in the emails. How are you defining compression? Actual compassion doesn't seem to have been a requirement in the challenge.
Is there any more information about this solution?
I'm rather skeptical of this claim that the data was compressed in 2018 because there is no further information, apart from a hash value given.
If it's a true claim they must have identified some "non-random" aspect of the original data, and then they could have given more info.
I didn’t quite get the method used to „compress“ the data from the article, maybe this rephrasing helps someone:
You basically split the file every time you encounter a specific character, and your compressor just combines all files it finds with the character you split by. If you split at every „X“ Char which might occur 1000 times in the file, the compressor only needs a small script which joins all files and puts an „X“ between them, which is less than 1000 bytes. The „trick“ is storing the location of the Xs you removed in the file sizes of the individual files.
My take was that the information is stored in the ordering of the files. The decompressor doesn't care about the file size of each file, right?
Both are needed. If you want to transmit this solution from one computer to another you need to store the size of each file (or insert a fancy delimiter that takes even more space).
I wonder if it would have been possible to win the challenge legitimately?
If a randomly-generated file happened to contain some redundancy through sheer chance, you could hand-craft a compressor to take advantage of it. This compressor would not work in general for random data, but it could work for this one particular case.
It's a bet worth taking, because the payoff, 50:1 ($5,000 to $100), is pretty good. Play the game 50 times and you might get a file you could compress.
The challenge, then, would be for the person offering the bet to generate a really random file that contained no such redundancy. That might not be easy.
I think you can make some argument about why this isn't possible at 50:1 odds.
A plausible "decompressor" is at least, say, 30 or 100 bytes, so the random file needs to have 30 bytes less entropy than you expected, which happens with probability X where X << 1/50. Sum over the whole domain of reasonable decompressors, and you still don't get there.
This argument could do with more rigor, but I think it's correct. Give me 100 million to 1 odds, though, and I'll take my chances trying to brute force a compressor.
This is actually an extremely interesting question. 'Weak' files that are more easily compressable than others certainly exist, but with low probability.
For example, the all-zeros file is a member of the set of all random 3 megabyte files, and it would certainly be possible to compress that, if by great good fortune you were lucky enough to receive it - albeit something that is unlikely to ever happen in the possible lifetime of the universe, given current physical theories.
Is it possible to quantify this idea of a 'weak' file more accurately?
Yes, you can think of it in terms of (WLOG think of any uniquely-decodable code) prefix-free codes. They're uniquely decodable - for things that are not uniquely decodable, that implies that you could put overlapping codes over that symbol. If you make a matrix like this where the rows are the bitstrings of length b and columns are individual bits:
then you have 2^b rows. Suppose you look at the sub-bitstrings of length k, k < b. They all appear the same number of times, if you count them wherever they appear at any position in across the entire matrix.However, you also know, for sure, that, if a prefix-free code appears in a particular row, that means since it's impossible to overlap with anything else in that row at its span. What does that imply? That the prefix-free codes have a greater 'occupancy percentage' of a single row than all other sub-bitstrings. That means that you must find fewer of them, on average, inside of a single row.
But since we know that all sub-bitstrings appear the same number of times throughout the entire matrix, what else can we deduce? That the prefix-free codes must appear /over more rows / on average, if they cannot appear as many times while looking at bit positions /along the columns/. That means they will occur as a sub-pattern in full-bitstrings more often than typical random sub-patterns.
So weakness here corresponds to the presence of patterns (prefix-free codes) that are:
- non-overlapping within bitstrings
- widely distributed across bitstrings
- due to their wide distribution, there's a higher chance of encountering these patterns in any given random file
- therefore, weak files are more compressible because they contain widely-distributed, non-overlapping patterns that compression algorithms can take advantage of
I know very little about this, but a little googling suggests that the measure you're looking for is entropy, which has a mathematical definition: https://en.wikipedia.org/wiki/Entropy_(information_theory)
One thing you can do, as the other commenter pointed out, is consider entropy of the file.
However, this restriction is too much for the purposes of this challenge. We don’t actually need a file with low entropy, in fact I claim that a weak file exists for files with entropy 8 (the maximum entropy value) - epsilon for each epsilon > 0.
What we actually need is a sufficiently large chunk in a file to have low entropy. The largeness is in absolute terms, not relative terms.
A very simple file would be taking a very large file with maximum entropy and adding 200 0’s to the end. This would not decrease the entropy of the file much, but it gives way to a compression algorithm that should be able to save ~100 bytes
Note that if this large chunk occurs in the middle of the file, then you will need extra space to encode that position. For example, a random bit string of length 2^n is decently likely to have a run of n zeroes. But this doesn’t help you because you need n bits just to encode where that run happens.
But storing an index for a file of length 2^n takes only n bits, so you need that run of 0’s to be of length n+1 to win
What if we didn't even attempt to compress the file? Theoretically, there is a random number generator + seed combination that would output the same bytes as the source file.
Neat idea but chances are the length of the seed is equal to the length of the original file.
There is a polynomial expression that will match the file. I wonder if expressing it would be compressible to a lower size than original file? [edit] I’m wrong - not all sequences can be expressed with a polynomial.
This reminds me of a data compression scheme I came up with once:
Treat an n bit file as a polynomial over the finite field with characteristic 2. Now, there are some irreducible polynomials in this field, but many polynomials have factors of x and of (x+1). Factor the polynomial into P(x)x^n (x+1)^m. and just collect these terms, storing only P, n, and m.
That's not how seeds work. Seeds are tiny.
Actually this would work perfectly if you knew it was generated in a single pass by a known random number generator and you had tons of time to brute force it.
If the file were generated by a natural source of entropy then forget it.
Or even if modified in a trivial way like adding 1 to every byte.
What is with the many downvotes but no comments? Everything I said is factual.
Seeds are something like 32 bits, though it depends on the exact implementation. But not the length of files.
Somewhere (discussed on HN) someone devised a "better-than-perfect" compressor. Most inputs get compressed (smaller than input), except for one input that does not. That one input is cryptographically impossible to find - or something along those lines.
Unfortunately I can't find the article I'm describing here, maybe someone else can? It was a long time ago so I might be misrepresenting it slightly.
That's how all compressors work, in that likely files (eg. ASCII, obvious patterns, etc) become smaller and unlikely files become bigger.
> likely files (eg. ASCII, obvious patterns, etc) become smaller
Likely files for a real human workload are like that, but if "most inputs" is talking about the set of all possible files, then that's a whole different ball game and "most inputs" will compress very badly.
> unlikely files become bigger
Yes, but when compressors can't find useful patterns they generally only increase size by a small fraction of a percent. There aren't files that get significantly bigger.
In some cases it can be certain, the ascii encoded in the usual 8 bits has fat to trim even if it is random in that space.
You could design the opposite, a system where ultra-rare files get compressed a lot but most don't compress.
But as stated, what you said is not possible. To compress by a single byte, only 1/256 files can be compressible, or you'd have multiple different inputs compressing down to the same output.
There's another interesting loophole to these general compression challenges. A universal Turing machine will necessarily compress some number of strings (despite the fact that almost all strings are incompressible). The set of compressible strings varies depending on the choice of UTM, and if your UTM if fixed, you're out of luck for random data. But if the UTM is unspecified, then there exist an infinite number of UTMs that will compress any specific string.
To some extent, this resembles the approach of "hiding the data in the decompressor". But the key difference here is that you can make it less obvious by selecting a particular programming language capable of universal computation, and it is the choice of language that encodes the missing data. For example, suppose we have ~17k programming languages to choose from—the language selection itself encodes about 14 bits (log2(17000)) of information.
If there are m bits of truly random data to compress and n choices of programming languages capable of universal computation, then as n/m approaches infinity, the probability of at least one language being capable of compressing an arbitrary string approaches 1. This ratio is likely unrealistically large for any amount of random data more than a few bits in length.
There's also the additional caveat that we're assuming the set of compressible strings is algorithmically independent for each choice of UTM. This certainly isn't the case. The invariance theorem states that ∀x|K_U(x) - K_V(x)| < c for UTMs U and V, where K is Kolmogorov complexity and c is a constant that depends only on U and V. So in our case, c is effectively the size of the largest minimal transpiler between any two programming languages that we have to choose from, and I'd imagine c is quite small.
Oh, and this all requires computing the Kolmogorov complexity of a string of random data. Which we can't, because that's uncomputable.
Nevertheless it's an interesting thought experiment. I'm curious what the smallest value of m is such that we could realistically compress a random string of length 2^m given the available programming languages out there. Unfortunately, I imagine it's probably like 6 bits, and no one is going to give you an award for compressing 6 bits of random data.
The problem with the UTM argument designed for a specific string is that the UTM size grows without bound. You also don’t need a UTM, which will be much larger than a simple TM or finite automaton. And now we’re back to the original problem: designing a machine capable of compressing the specific string and with smaller description.
UTM provides no gain.
The issue with the invariance theorem you point out always bugged me.
Let s be an algorithmically random string relative to UTM A. Is it the case that there exists some pathological UTM S, such that K(s|S) (the Kolmogorov complexity of s relative to S) is arbitrarily small? I.e. the blank print statement of S produces s. And there always exists such an S for any s?
Is there some way of defining a meta-complexity measure, the complexity of some UTM without a reference UTM? It seems intuitive that although some pathological UTM might exist that can "compress" whichever string you have, its construction appears very unnatural. Is there some way of formalizing this "naturalness"?
I would have expected it to be possible to compress a single arbitrary random file with a small program. I would have thought an RNG could generate a file with some weakness that allows you to compress it, although said compressor would be worse at other inputs.
Likewise this challenge would have been stronger if the requirement was to compress two provided arbitrary files :P
I thought the same thing. Surely a random binary sequence will have some (small) repeating sequences? So as long as we can find them and (efficiently) mark their locations, we can have some small size reduction.
Which is basically what he’s done here. He’s cheating by marking the location of the repeating sequence using unique files, rather than some other actually more efficient location system.
This is a fun discussion! I think it helps to make compression feel more approachable for armchair enthusiasts.
> I think that I did not make a mistake except to believe that you truly intended to attempt to compress the data I was sending you.
Weird phrase to hear from a digital carnival scam artist.
This is hilarious, but Mike’s behavior boils my blood. To switch the rules when fairly beaten is just so scummy!
Of course Patrick’s solution occurred to me immediately once he started talking about splitting up the file. :)
Then I got to wondering if it would be possible to compress a large enough file and save a few bytes, in the spirit of what Mike actually wanted. For instance, if you tried encrypting with a bunch of random keys, you’ll find some such that the result ends with n 0s (say), so then your result could be something like encrypted + key + n. Typically you’d expect to find that the length of the encryption key would be roughly the length of the number of 0s, so you wouldn’t actually net any savings. But I think if you searched long enough, you might “get lucky” and find a key that generated more 0s then the length of the key, so you could theoretically win the challenge. Of course you’d have to somehow embed a program which could decrypt and reassemble everything in a tiny amount of space, but it still doesn’t seem impossible like Mike seems to suggest. Perhaps just requires an intractable amount of computing power. :)
I’m not sure if any of that made sense. I think I’m using the wrong terms for things.
I remember reading from a previous thread how yes you can do something like that (or lots of other techniques) but they will only work on a small percentage of files.
That sometimes the smallest decompressor (including the key, in your example) will result in overall less space, but usually it will result in more (because the smallest key that generates 10 zeros happens to be 14 digits long, etc.).
And so ultimately the more interesting question becomes about those probabilities, and therefore what is the right entry price for the bet to be even?
I particularly like this:
> It's not my fault that a file system uses up more space storing the same amount of data in two files rather than a single file.
Related. Others?
The $5000 Compression Challenge - https://news.ycombinator.com/item?id=9163782 - March 2015 (168 comments)
The $5000 Compression Challenge - https://news.ycombinator.com/item?id=5025211 - Jan 2013 (61 comments)
The $5000 Compression Challenge - https://news.ycombinator.com/item?id=4616704 - Oct 2012 (119 comments)
Story time. In our Data Structures and Algorithms class during my undergrad, we had a similar challenge posed to us by an associate professor who was quite proud of the fact that it had never been beaten.
The sophistication of my compression understanding at the time extended to lempel-ziv and huffman and well... that's about it. So I was like, “Okay, I’m not going to create the world’s best compression method on a whim.” So I started thinking: if I can’t compress the file, maybe I can just fetch it from somewhere else.
The first thing I tried was writing a program that searches online by uploading the file onto a public FTP. From there, I could fetch it. Simple enough, right?
Well, here’s the issue with that: there’s no guarantee I’d even have internet connectivity. That’s probably the first thing he’d block.
So, he gives us the file. I think it was something like 10 MB, and when I looked through it, the entropy was high. Like, really high. Standard compression tools like 7Z or rar wanted NOTHING to do with it.
Then I started thinking again. Maybe this guy’s running all the submissions back-to-back. He’s probably processing them in sequence because he’s got tons of entries coming in. So I thought: if I can’t download the file, maybe there's a chance it's already on the machine from somebody's previous simulation - I can find it somewhere on the hard drive. My program might not be a great decompression system, but you know what it is? A fantastic file search.
So, I threw together a program that tree searched the entire disk with a simple checksum to verify I’d found the exact right file. If it matched, I’d just copy the file into the current directory and say, “I’m done.” - though if it found the file too quickly it would continue idling for a while so as to avoid suspicion. (I was also banking on the fact that someone else’s program might’ve run before mine, leaving the file lying around somewhere for me to grab.)
Of course, I couldn’t make this look too obvious. I mean, imagine submitting a tiny C++ program claiming it could compress a 10 MB file into just a few kilobytes. He’d immediately know it was fake.
So, I padded the compressed binary, adding a bunch of random hexadecimal and binary junk to bulk it up to around 5 MB. It was still phenomenal—better than any other method on the market for this type of file—but at least it seemed somewhat plausible. I mean, maybe I’d "discovered" some revolutionary new compression algorithm or something and smashed the weissman score.
The contest went on for nearly a month, and the guy had a leaderboard tracking scores. Most people were hovering just slightly above standard compression rates—around 9 MB. Nothing groundbreaking, though.
So, it’s a Friday night. I finish the program, pack it all up, and send it off. Then I go to bed. Needless to say, I woke up with a quite an excited response from my professor.
Don’t offer up a game you aren’t willing to lose. This should have had a mathematical definition and required a formal proof if Mike wanted to play it this way. By allowing for word play he made this a game of interpretation and not a true mathematical game.
I would not have accepted multiple files nor scripting. Scripts offers up the entire contents of the linux operating system and this could allow you to use operating system files as partial data. To really test compression/decompression the rules need to be a lot tighter. The challenge was met and beat because the criteria was bad. It leaves open a number of dishonest approaches.
The decompression program needs to be constrained to itself and the singular file to decompress and can not utilise anything other than basic operating system functions for the purpose of reading and writing the constrained files. Any bytes the program accesses otherwise can be used as part of the total count.
It's very hard to constrain this without some form of chest being possiblemespecially if you have some years to set it up.
> Scripts offers up the entire contents of the linux operating system and this could allow you to use operating system files as partial data.
I also thought about this, but when thinking more about the argument by the challenge creator, I understood that the data in the OS doesn’t help you.
If you’re given a file with 100 bits, there are 2^100 possibilities. If you want to compress to 99 bits, there are only 2^99 possible compressed outputs. So when generating and trying to compress every 100 bit file, you’re not going to be able to compress half of them. The external data doesn’t help you unless you get to modify it after receiving the file to compress. If you can get patched merged into the Linux kernel, you could use the data, but if the data is fixed, you can’t reference the data with less data than the original file (for some cases atleast).
Edit: actually, when thinking about it: you could actually use the OS as another few bits of out of band information you get for free, which makes it possible again. If you have 8 different OS to choose from, you could use their differences to compress any file. At least theoretically, the 3 bits you gain practically probably aren’t enough to make a difference.
One possibility I considered is that your script could use an "apt install" to provide out of band data, you could put the information you need into the operating system from the universe repositories/your own package server.
The computer used for the test probably doesn’t have network connectivity, otherwise you could just download the data with curl.
My decompressor:
curl ftp://path/to/original.dat
suppose the data were generated by a csprng with 256 bits of state. then in fact there are just slightly more than 256 bits of entropy there (prng state plus the length of the generated file). only, you'd have a snail's chance in hell of actually finding that seed
If he really wanted to make sure he kept the money, he could just use a hardware RNG. I don’t know how common they were in 2001 but you could probably even generate some practically unpredictable random numbers using radio noise.
Mike thanked random.org for the data. It's exactly what they are doing.
I’d take this bet.
> With this offer, you can tune your algorithm to my data.
One can generate a 1GB file or 10GB file. It is highly likely that there is some form of a repeatable pattern in there to shave off 50-100 bytes by sliding window search.
Then the decompressor is essentially - at this index, expand this pattern. The compressed file excludes the range of pattern.
One may not always get such a pattern, but on multiple tries it’s very feasible. Payout just needs one win in 50 tries.
You could generate a 100GB file. The bigger the file the higher the chance of repeating pattern.
The challenge is won if compressed_file + decompressor is less one byte than original file.
One could have a self executing decompressor to save some file overhead bytes.
Your idea fails pretty easily. Simply do the math, and you’ll find your idea won’t work.
One argument: if your method worked, then we could compress nearly any file. This violates the pigeonhole principle.
Another example: for a 4GB file you’d need roughly a 4 byte integer to specify where the repeat started and where the second was, for 8 bytes. Then you need a length byte. Now you need a repeated 10+ bytes sequence to make this compress anything. There are 256^10=2^80 possible such 10 bytes sequence sequences, and only ~2^32 of them in a 4 gb file. So the odds of a repeated are around 1 in 2^50.
Tweak methods and estimate as you wish. It fails.
As a general principle, absolutely.
In practice, I wonder what size of file we're talking about that would result in net compression on random data 50% of the time?
I have no intuition whether it's 1 GB, 1 TB, 1 PB, or beyond.
Nope. As your file gets longer so does the data needed to specify where the repeats are. See my other estimate in this thread. It fails spectacularly.
good compression = pattern eventually
maximum compression = indelineable from random data
I though he would try to net profit through probability by requesting to regenerate the file and hope to get a compressible one in at least 5000/100=50 times. (Though I don't think that would work.)
This gets reposted quite often.
The crux of the ongoing debate over the post has to do with the ambiguity of the stated requirements.
For there to be any rational discussion, you need to have proper unique definitions for what is being communicated. Without a definition (identity) no system of truth seeking will provide a definitive truth.
Most notably, "compression" has many different meanings currently in practical use. For example many people use the word compression and simply mean that the uncompressed size is more than the compressed size; lossy vs lossless doesn't matter when its good enough, all that matters is its been reduced, and an example of that might be H264 video.
Other more rigorous individuals will use it to mean entropy encoding, which has a strict definition, part of which is lossless, and in these cases they often view Shannon's theorems as gospel, and at that point they often stop thinking and may not try to sharpen their teeth on a seemingly impossible task.
These theorems are often very nuanced, with the bounds very theoretical or still undiscovered, though still seemingly correct. When someone is told something is impossible, they don't fully apply themselves to try to find a solution. Its a worn path that leads nowhere, or so they imagine.
This mindset of learned helplessness prevents a nuanced intuitive understanding of such problems and any advances stagnate. People take the impossibility at face value, and overgeneralize it (without that nuanced understanding), this is fallacy and it can be very difficult to recognize it when there is no simple way to refute it on its face.
Here is something to consider. Shannon's source coding theorem relies on the average uncertainty of probabilities as a measure of entropy (information). Some exploration is needed to understand what makes this true, or where it fails. It has been my experience that only a very few exceptional people ever attempt this process.
The main question that jumps out in my mind since compression has been a thing I've been interested in for years; is, can an average entropy in statistics tell us accurately whether this is true in systems with probabilities and distributions that are a mix of mathematical chaos but also contain regularity within their domains? Statistics can be unwieldy beasts, and act as regular sources of misunderstanding and falsehood without a very rigorous approach.
For a similar problem, people may find this article particularly relevant with regards to the chaotic 3-body problem in astrophysics.
https://www.aanda.org/articles/aa/full_html/2024/09/aa49862-...
It’s amazing that this commentary on the value of prediction markets was written in 2001.
> It's not my fault that a file system uses up more space storing the same amount of data in two files rather than a single file.
It's not interesting to see people play word games with each other in the HN comments, so why is it interesting to see this kind of disingenuous behavior? Trolls have always existed. I don't get what's interesting about it.
The act of offering a cash prize to prove you wrong kind of makes you look like a jerk. It's assumed that you will use every loop hole at your disposal. Seeing the situation reversed so elegantly makes me root for the challenger.
As Patrick said, if he had gotten a 1000-byte file and sent a 999-byte file/decompressor file back, nobody would have counted the inode data as a part of the solution. Why count it now?
Because the solution specifically was for 1 file submitted, 2 files back (compressed and decompressor). This topic ise the next level trolling that went on. The summary and ng faw, explained why these file metagames aren't useful. They obscure the lack of a compression strategy (eg if the inode equivalent was 2ft of tape) by nature of dispersing information into the fs.
> Because the solution specifically was for 1 file submitted, 2 files back (compressed and decompressor).
Patrick:
I meant can I send you a decompressor and several compressed files whose total file size is less than the original uncompressed file and from which I can regenerate the original uncompressed file
Mike:
Sure.
Who's the bigger troll?
Putting up the initial challenge, especially with the accompanying commentary, is just as much trolling as that.
I do agree that this last sentence (the one you quoted) is very unfortunate and doesn't match the wit of the rest. A better closing remark might have been something on the lines of, “I never claimed to have performed any compression. I only claimed to have met the terms of the challenge. I asked if I can send multiple files and the response was ‘sure’. I provided the files; they meet the total file size criterion; and the script correctly reconstructs the original files.”
this sounds like a Nigerian Prince scammer set-up
Take random bytes from /dev/urandom, encrypt it. There is no way any compressor can reduce the size of that file. I'm pretty sure there is no way the second player can win this game. Actually, the encryption part isn't even necessary. There is no compressor that can reduce the size of a random stream of bytes.
There is no general purpose compressor that can achieve compression on all random streams of bytes... but there is a possibility to compress a specific random stream of bytes, if you get lucky
Absolutely! Can this probability be quantified?
You'd have to define what you mean by luck which would be equivalent to choosing some pattern that your compressor can recognize in a random stream of bytes and then computing the probability that pattern appeared in some random byte sequence. It's not a winnable game in any practical sense because whatever compressor you choose the probability of getting a sequence from your opponent that will conform to the patterns recognizable by your compressor is essentially 0.
The encryption part might actually be harmful: for (silly) example, if you encrypt using the Bacon cipher, which turns every letter into five letters all of which are either "a" or "b". :) You'd need to pick an encryption method that doesn't expand the text at all, not even via a header or footer block. Better to stick with bytes generated uniformly at random, in which case you are correct.
[1] - https://en.wikipedia.org/wiki/Bacon%27s_cipher
But you don’t need to compress every possible file to make playing such a game a good idea.
Suppose if you shave 1 bit off of a sample bit stream you win and you lose if that fails. Your compressor looks at the first 2 bits and if it starts 11 it starts with 1 and otherwise it starts 0 then the bit stream. Now you have a 1 in 4 chance to compress a random string by 1 bit and. A 3 in 4 chance of making it longer by 1 bit.
It’s ultimately not about compression but what odds are worth it.
I read „decompressor and compressed file”, not „fileS”. There is only one person correct here
You should read the rest of the thread.