1) The handling of dynamic blocks leaves something to be desired. The parameters are left mostly undecoded. It'd be really neat if the Huffman symbols were listed somewhere, rather than just being left implicit.
2) The visualization falls apart pretty badly for texts consisting of more than one block (which tends to happen around 32 KB) - symbols are still decoded, but references all show up blank.
Large inputs make the page hang for a bit, but that's probably pretty hard to avoid.
And as an enhancement: it'd be really cool if clicking on backreferences would jump to the text being referenced.
Exactly, it misses out on explaining how the fixed Huffman table is interpreted to apply symbol and distance codes, or how dynamic tables are derived from the input itself. Sure it's the hardest part, but also the more interesting to visualize. As another commenter pointed out, we are just left with mysterious bit sequences for these codes.
As someone who's never really read that much on compression stuff, I have absolutely zero clue what this visualisation is actually showing me.
That's compounded by the lack of legend. What do the different shades of blue and purple tell me? What is Orange?
e.g. on a given text in an orange block it puts e.g. x4<-135. x4 seems to indicate that the first 4 binary values for the block are important, but I can't figure out what that 135 is referencing (I assume it's some pointer to a value?)
It is a backreference, the main way of dealing with full or partial repetitions in the LZ77 algorithm. It literally means: copy 4 characters from the backward offset of 135. Note that this "backward offset" can overlap previously repeated characters, so x10<-1 equally means: copy the last character 10 times.
Using this example paragraph, at compression level 1 or higher (copy with the quotation symbols):
“It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair.”
The red bit at the beginning is Zlib header information and parameters. This basically tells the decoder the format of the data coming up, how big the data is, etc.
The following grey section is the huffman coding tables - more common characters in the input are encoded in a fewer number of bits. This is what later tells the decoder, that 000 means 'e' and 1110110 means 'I'.
Getting into the content now - this is where the decoder can start emitting the uncompressed text. The first 3 purple characters are the unicode values for the fancy opening quote - because they're rare in this text, they're each encoded as 6 or 7 bits. Because they take a lot of bits, this website is showing them as a purple color, as well as physically wider. The nearby 't' is encoded in 4 bits, 0110, and is represented in a bluer color.
The orange bits you've mentioned are back references - "x10 <- 26" here means "go back 26 characters in what you've decoded, and then copy 10 characters again." In this way, we can represent "t was the " in only 12 bits, because we've seen it previously.
The grey at the end is a special "end of stream" marker, followed by a red checksum which allows decoders to make sure there wasn't any corruption in the input.
This is very work in progress, but for folks looking for a deeper explanation of how dynamic blocks are encoded, this is my attempt to visualize them.
(This all happens locally with way too much wasm, so attempting to upload a large gzip file will likely crash the tab.)
tl;dr for btype 2 blocks:
3 bit block header.
Three values telling you how many extra (above the minimum number) symbols are in each tree: HLIT, HDIST, and HCLEN.
First, we read (HCLEN + 4) * 3 bits.
These are the bit counts for symbols 0-18 in the code length tree, which gives you the bit patterns for a little mini-language used to compactly encode the literal/length and distance trees. 0-15 are literal bit lengths (0 meaning it's omitted). 16 repeats the previous symbol 3-6 times. 17 and 18 encode short (3-10) and long (11-138) runs of zeroes, which is useful for encoding blocks with sparse alphabets.
These bits counts are in a seemingly strange order that tries to push less-likely bit counts towards the end of the list so it can be truncated.
Knowing all the bit lengths for values in this alphabet allows you to reconstruct a huffman tree (thanks to canonical huffman codes) and decode the bit patterns for these code length codes.
That's followed by a bitstream that you decode to get the bit counts for the literal/length and distance trees. HLIT and HDIST (from earlier) tell you how many of these to expect.
Again, you can reconstruct these trees using just the bit lengths thanks to canonical huffman codes, which gives you the bit patterns for the data bitstream.
Then you just decode the rest of the bitstream (using LZSS) until you hit 256, the end of block (EOB).
If you're not already familiar with deflate, don't be discouraged if none of that made any sense. Bill Bird has an excellent (long) lecture that I recommend to everyone: https://www.youtube.com/watch?v=SJPvNi4HrWQ
if you paste contents which contain a very particular string ("intends to sue the human race for stealing our honey"), the contents are replaced with the phrase "the bee movie script? really? how original"
Also got got. I assume the Bee Movie script is the first choice for a lot of people needing an ad-hoc big block of text. It also compresses pretty well.
It's mostly what heliodex said, that it's a copypasta when ppl need big text. There's also a compression meme around the bee and other movies (on yt: bee movie in 10s or it gets faster everytime X). But unlike i thought in the beginning it's not a zlib specific joke.
Yeah, the movie became a bit of a meme at some point and somehow shoehorning in "the entire bee movie script" into random places became a part of that.
It's the resulting string the tool gives instead of the actual compressed string info. You can see the result directly by putting some text which contains "intends to sue the human race for stealing our honey" into the input text box.
The byte counter seems broken somehow. "Compressing" a single character with a compression level of 0 says "12 bytes", yet in the visualization there's less than 8 bytes (~7.5).
When compressing with a level higher than 0, the bits also don't appear to add up to a natural number of bytes, so I'm thinking the visualization is missing some padding?
I was expecting something about how many books they had, so this was a funny surprise. I do wonder if the naming was a deliberate attempt at hiding, much like naming a torrent tracker after the sound made by a pig.
Two and a half issues:
1) The handling of dynamic blocks leaves something to be desired. The parameters are left mostly undecoded. It'd be really neat if the Huffman symbols were listed somewhere, rather than just being left implicit.
2) The visualization falls apart pretty badly for texts consisting of more than one block (which tends to happen around 32 KB) - symbols are still decoded, but references all show up blank.
Large inputs make the page hang for a bit, but that's probably pretty hard to avoid.
And as an enhancement: it'd be really cool if clicking on backreferences would jump to the text being referenced.
Exactly, it misses out on explaining how the fixed Huffman table is interpreted to apply symbol and distance codes, or how dynamic tables are derived from the input itself. Sure it's the hardest part, but also the more interesting to visualize. As another commenter pointed out, we are just left with mysterious bit sequences for these codes.
It would be cool if we could supply our own Huffman table and see how that affects the stream itself. We might want to put our text right there! https://github.com/nevesnunes/deflate-frolicking?tab=readme-...
I think this is something that makes a decent teaching aid but doesn't work well for the uninitiated.
You need someone to spell out exactly what each of the sections are and what they are doing.
As someone who's never really read that much on compression stuff, I have absolutely zero clue what this visualisation is actually showing me.
That's compounded by the lack of legend. What do the different shades of blue and purple tell me? What is Orange?
e.g. on a given text in an orange block it puts e.g. x4<-135. x4 seems to indicate that the first 4 binary values for the block are important, but I can't figure out what that 135 is referencing (I assume it's some pointer to a value?)
It is a backreference, the main way of dealing with full or partial repetitions in the LZ77 algorithm. It literally means: copy 4 characters from the backward offset of 135. Note that this "backward offset" can overlap previously repeated characters, so x10<-1 equally means: copy the last character 10 times.
Using this example paragraph, at compression level 1 or higher (copy with the quotation symbols):
“It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of light, it was the season of darkness, it was the spring of hope, it was the winter of despair.”
The red bit at the beginning is Zlib header information and parameters. This basically tells the decoder the format of the data coming up, how big the data is, etc.
The following grey section is the huffman coding tables - more common characters in the input are encoded in a fewer number of bits. This is what later tells the decoder, that 000 means 'e' and 1110110 means 'I'.
Getting into the content now - this is where the decoder can start emitting the uncompressed text. The first 3 purple characters are the unicode values for the fancy opening quote - because they're rare in this text, they're each encoded as 6 or 7 bits. Because they take a lot of bits, this website is showing them as a purple color, as well as physically wider. The nearby 't' is encoded in 4 bits, 0110, and is represented in a bluer color.
The orange bits you've mentioned are back references - "x10 <- 26" here means "go back 26 characters in what you've decoded, and then copy 10 characters again." In this way, we can represent "t was the " in only 12 bits, because we've seen it previously.
The grey at the end is a special "end of stream" marker, followed by a red checksum which allows decoders to make sure there wasn't any corruption in the input.
I think that's everything. Further reading: https://en.wikipedia.org/wiki/Zlib https://en.wikipedia.org/wiki/Deflate https://en.wikipedia.org/wiki/Huffman_coding
Something must be in the air. I've been working on a gzip/deflate visualizer recently as well: https://jonjohnsonjr.github.io/deflate/
This is very work in progress, but for folks looking for a deeper explanation of how dynamic blocks are encoded, this is my attempt to visualize them.
(This all happens locally with way too much wasm, so attempting to upload a large gzip file will likely crash the tab.)
tl;dr for btype 2 blocks:
3 bit block header.
Three values telling you how many extra (above the minimum number) symbols are in each tree: HLIT, HDIST, and HCLEN.
First, we read (HCLEN + 4) * 3 bits.
These are the bit counts for symbols 0-18 in the code length tree, which gives you the bit patterns for a little mini-language used to compactly encode the literal/length and distance trees. 0-15 are literal bit lengths (0 meaning it's omitted). 16 repeats the previous symbol 3-6 times. 17 and 18 encode short (3-10) and long (11-138) runs of zeroes, which is useful for encoding blocks with sparse alphabets.
These bits counts are in a seemingly strange order that tries to push less-likely bit counts towards the end of the list so it can be truncated.
Knowing all the bit lengths for values in this alphabet allows you to reconstruct a huffman tree (thanks to canonical huffman codes) and decode the bit patterns for these code length codes.
That's followed by a bitstream that you decode to get the bit counts for the literal/length and distance trees. HLIT and HDIST (from earlier) tell you how many of these to expect.
Again, you can reconstruct these trees using just the bit lengths thanks to canonical huffman codes, which gives you the bit patterns for the data bitstream.
Then you just decode the rest of the bitstream (using LZSS) until you hit 256, the end of block (EOB).
If you're not already familiar with deflate, don't be discouraged if none of that made any sense. Bill Bird has an excellent (long) lecture that I recommend to everyone: https://www.youtube.com/watch?v=SJPvNi4HrWQ
Damn I tried the bee movie script, they got me.
got how?
if you paste contents which contain a very particular string ("intends to sue the human race for stealing our honey"), the contents are replaced with the phrase "the bee movie script? really? how original"
Also got got. I assume the Bee Movie script is the first choice for a lot of people needing an ad-hoc big block of text. It also compresses pretty well.
https://github.com/lynn/flateview/blob/2668beaa5cc8cae387b6f...
How is that a thing? Guess I shall go down the rabbit hole.
Also for me the first time i hear about this. There goes the next hour.
Please report back, as many of us don't have an hour. We rely on soldiers like you! Thanks!
It's mostly what heliodex said, that it's a copypasta when ppl need big text. There's also a compression meme around the bee and other movies (on yt: bee movie in 10s or it gets faster everytime X). But unlike i thought in the beginning it's not a zlib specific joke.
Yeah, the movie became a bit of a meme at some point and somehow shoehorning in "the entire bee movie script" into random places became a part of that.
What happened to lorem ipsum?
not easily compressible, i guess?
It should really normalise whitespace before that check, because the version of the script I found split the line :)
Um, sorry, I don't really get it. Is "the bee movie script? really? how original" a comment?
It's the resulting string the tool gives instead of the actual compressed string info. You can see the result directly by putting some text which contains "intends to sue the human race for stealing our honey" into the input text box.
Thanks. So only for this tool - not zlib normally?
Yes https://github.com/lynn/flateview/blob/2668beaa5cc8cae387b6f...
The byte counter seems broken somehow. "Compressing" a single character with a compression level of 0 says "12 bytes", yet in the visualization there's less than 8 bytes (~7.5).
When compressing with a level higher than 0, the bits also don't appear to add up to a natural number of bytes, so I'm thinking the visualization is missing some padding?
At least for me, compressing a single "a" at compression level 0 gives me an output of 91 bits, which rounds up to 12 bytes.
I would really like to see one of these for brotli.
Also for zopfli vs level 9 compression with this tool as-is.
This is great! Just missing a way to understand how the parameters are encoded, or is there something somewhere?
s/Z-Lib/zlib/
I wonder if this can be blamed on the HN title auto-shortener or not...
I was expecting something about how many books they had, so this was a funny surprise. I do wonder if the naming was a deliberate attempt at hiding, much like naming a torrent tracker after the sound made by a pig.