My ZIP isn't your ZIP: Identifying and exploiting semantic gaps between parsers

(usenix.org)

47 points | by layer8 4 days ago ago

19 comments

est 4 minutes ago

IIRC similar attacks exist on DEFLATE

there used to be a .png picture displays totally different content on safari/firefox/IE.

saurik 4 hours ago

I'm cited on the first page of this paper (reference 20) for my work on the Android Master Key vulnerability (which I didn't find, to be clear, but I did most of the exploitation people saw), and, while this paper looks AWESOME (and I'm very excited to read it in detail), if you are interested in this concept but feel you need something a bit more concrete--maybe with diagrams and some hand-holding--to understand what is going on, I will recommend my series of articles on Master Key as an introduction.

https://www.saurik.com/masterkey1.html

https://www.saurik.com/masterkey2.html

https://www.saurik.com/masterkey3.html

schoen an hour ago

This is great. It feels like a central example of the phenomenon of parser differentials (and nice use of tools to find them more efficiently).

Also, as the lead author's name is spelled the same as an English pronoun, we can anticipate natural language parsing ambiguities from writing about this research in English prose! For example, "You discovered that there are many opportunities for parser differentials due to the underspecified nature of the ZIP format" or "You described a practical method of bypassing plagiarism detectors and several other kinds of file content scanners".

Actually, I'm tempted to propose that for the April Fool's Did You Know? on Wikipedia next year. "Did you know ... that You won a Usenix Security award for finding ways to construct ambiguous texts?"

pabs3 an hour ago

A linter for zip files that can probably detect some of these:

https://github.com/ronomon/pure

captn3m0 4 hours ago

Also related to ZIP parsing differentials, recently reported and fixed at PyPi: https://blog.pypi.org/posts/2025-08-07-wheel-archive-confusi...

[-]

tptacek an hour ago

It's good to see stuff like this getting found and fixed, but let me ask: given how the Python packaging ecosystem works, what is the practical scenario in which this would be exploitable?

tptacek 4 hours ago

This is a really good paper that reaches a bunch of fun conclusions, but to my eyes the practical findings are kind of marginal --- you can defeat an AV scanner, but you could already defeat AV scanners; you can defeat plagiarism-detectors, but you could already defeat plagiarism-detectors; you can package a malicious Java class in a benign-looking JAR, but that attack presumes you're convincing a target to load a JAR file you control.

The one legit-practical attack I see is the one where they trick the VS Code Extension marketplace into serving extensions with trusted publishers, but even there I'm struck by the fact that the security model for verifying extensions would depend on ZIP metadata.

I do not at all mean to talk this work down; this is my favorite species of vulnerability research, and I can see why it did well at Usenix Security.

[-]

FreakLegion 2 hours ago

It's a decent systematic look at something people have been doing ad hoc for a long time. In 2010 or so I realized:

1. Authenticode signatures have unauthenticated sections.

2. ZIP files don't require headers.

So you can shove a ZIP file (i.e. JAR, DOCM, APK, etc.) into a signed Windows executable without breaking its signature, and then depending on the extension it will do any number of things when clicked.

(The extent to which this works has changed a lot in the intervening years, but prior to a patch in 2013 it was especially bad, and the patches never made their way into the spec, so custom Authenticode validators like Wine's or, say, the one in Palo Alto Networks gear, were still vulnerable the last time I checked.)

Anyway, at the same time:

1. Cybersecurity products lean on Authenticode to keep false positives down for specific publishers.

2. Those same products cache everything by hash without regard for file type.

Put all of this together and you could, as of 2020 at least, not only execute whatever you wanted, you could also have it misreported by CrowdStrike or whoever as a signed Windows component.

Fun stuff, but I agree that it's kind of marginal.

pixl97 4 hours ago

Zip is a fun minefield across different OS's, libraries, and ages of system. Zip64 is a fun one I've seen companies forget to test and end up with data loss with over 65535 files in a zip when interacting with more modern systems. There are really so many things you need to test that going with some other compression without the pitfalls is your best choice if possible.

o11c 6 hours ago

Key line from the abstract, since zip parser differences in general are old news:

> We summarize our findings as 14 distinct parsing ambiguity types in three categories with detailed analysis, systematizing current knowledge and uncovering 10 types of new parsing ambiguities.

actionfromafar 6 hours ago

Tampering with signed binaries sounds pretty serious

[-]

tptacek 4 hours ago

It depends on how they're signed. A signature format that works on individual objects inside of an archive, rather than on a whole signed archive, seems crazy. In this case, it's a JAR file loader; doesn't seem like that big a deal?

hinkley 6 hours ago

Maybe an argument to use zlib consistently.

[-]

woodruffw 5 hours ago

Unless, of course, the differential occurs between versions of zlib. I think the bigger problem here is that ZIP is just not a very well defined format.

aaviator42 6 hours ago

An argument for a better defined file format specification perhaps, but I don't think it's necessarily a good thing for everyone to use or have to use the same implementation.

[-]

Muromec 5 hours ago

If everyone has the same parser the whole classes of bugs just stop being exploitable. The classic one being one parser at the edge validates somethhing and the further down the line sees another result which it expects tp be rejected during validation.

Both parsers could be buggy, but when they have different kinds of bugs, you get a zero click undetectable exploit

[-]

woodruffw 5 hours ago

I don’t think it’s this simple: you can still produce observable differentials with a single parser by using different options within that parser in different places. The ZIP format itself affords ample opportunities for that.

socalgal2 3 hours ago

As someone who works on specs that are shared across different organizations' implementations, you can write all the specs you want but no conformance tests = no conformance.

blibble 5 hours ago

zlib (deflate) is just the compression type usually (not always) used in zips

zip is the container around it