1. PDFs support arbitrary attached/included metadata in whatever format you like.
2. So everything that produces PDFs should attach the same information in a machine-friendly format.
3. Then everyone who wants to "parse" the PDF can refer to the metadata instead.
From a practical standpoint: my first name is Geoff. Half the resume parsers out there interpret my name as "Geo" and "ff" separately. Because that's how the text gets placed into the PDF. This happens out of multiple source applications.
Disclaimer - Founder of Tensorlake, we built a Document Parsing API for developers.
This is exactly the reason why Computer Vision approaches for parsing PDFs works so well in the real world. Relying on metadata in files just doesn't scale across different source of PDFs.
We convert PDFs to images, run a layout understanding model on them first, and then apply specialized models like text recognition and table recognition models on them, stitch them back together to get acceptable results for domains where accuracy is table stakes.
It might sound absurd, but on paper this should be the best way to approach the problem.
My understanding is that PDFs are intended to produce an output that is consumed by humans and not by computers, the format seems to be focused on how to display some data so that a human can (hopefully) easily read them. Here it seems that we are using a technique that mimics the human approach, which would seem to make sense.
It is sad though that in 30+ years we didn't manage to add a consistent way to include a way to make a PDF readable by a machine. I wonder what incentives were missing that didn't make this possible. Does anyone maybe have some insight here?
> the format seems to be focused on how to display some data so that a human can (hopefully) easily read them
It may seem so, but what it really focuses on is how to arrange stuff on a page that has to be printed. Literally everything else, from forms to hyperlinks, were later additions (and it shows, given the crater-size security holes they punched into the format)
I've built document parsing pipelines for a few clients recently, and yeah this approach yields way superior results using what's currently available. Which is completely absurd, but here we are.
I've done only one pipeline trying parse actual PDF structure and the least surprising part of it is that some documents have top-to-bottom layout and others have bottom-to-top, flipped, with text flipped again to be readable. It only goes worse from there. Absurd is correct.
If the html in question would include javascript that renders everything, including text, into a canvas -- yes, this is how you would parse it. And PDF is basically that
While we have a PDF internals expert here, I'm itching to ask: Why is mupdf-gl so much faster than everything else? (on vanilla desktop linux)
Its search speed on big pdfs is dramatically faster than everything else I've tried and I've often wondered why the others can't be as fast as mupdf-gl.
While you're doing this, please also tell people to stop producing PDF files in the first place, so that eventually the number of new PDFs can drop to 0. There's no hope for the format ever since manager types decided that it is "a way to put paper in the computer" and not the publishing intermediate format it was actually supposed to be. A vague facsimile of digitization that should have never taken off the way it did.
I think it's a useful insight for people working on RAG using LLMs.
Devs working on RAG have to decide between parsing PDFs or using computer vision or both.
The author of the blog works on PdfPig, a framework to parse PDFs. For its document understanding APIs, it uses a hybrid approach that combines basic image understanding algorithms with PDF metadata . https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Anal...
GP's comment says a pure computer vision approach may be more effective in many real-world scenarios. It's an interesting insight since many devs would assume that pure computer vision is probably the less capable but also more complex approach.
As for the other comments that suggest directly using a parsing library's rendering APIs instead of rasterizing the end result, the reason is that detecting high-level visual objects (like tables , headings, and illustrations) and getting their coordinates is far easier using vision models than trying to infer those structures by examining hundreds of PDF line, text, glyph, and other low-level PDF objects. I feel those commentators have never tried to extract high-level structures from PDF object models. Try it once using PdfBox, Fitz, etc. to understand the difficulty. PDF really is a terrible format!
How is it reasonable to render the PDF, rasterize it, OCR it, use AI, instead of just using the "quality implementation" to actually get structured data out? Sounds like "I don't know programming, so I will just use AI".
> How is it reasonable to render the PDF, rasterize it, OCR it, use AI, instead of just using the "quality implementation" to actually get structured data out?
Because PDFs might not have the data in a structured form; how would you get the structured data out of an image in the PDF?
As someone who had to parse form data from a pdf, where the pdf author named the inputs TextField1TextField2TextFueld3 etc.
Misspellings, default names, a mixture, home brew naming schemes, meticulous schemes, I’ve seen it all. It’s definitely easier to just rasterize it and OCR it.
Same. Then someone edits the form and changes the names of several inputs, obsoleting much of the previous work, some of which still needs to be maintained because multiple versions are floating around.
I do PDF for a living, millions of PDFs per month, this is complete nonsense. There is no way you get better results from rastering and OCR than rendering into XML or other structured data.
How many different PDF generators have done those millions of PDFs tho?
Because you're right if you're paid to evaluate all the formats with the Mark 1 eyeball and do a custom parser for each. It sounds like it's feasible for your application.
If you want a generic solution that doesn't rely on a human spending a week figuring out that those 4 absolutely positioned text fields are the invoice number together (and in order 1 4 2 3), maybe you're wrong.
Source: I don't parse pdfs for a living, but sometimes I have to select text out of pdf schematics. A lot of times I just give up and type what my Mark 1 eyeball sees in a text editor.
We process invoices from around the world, so more PDF generators than I care to count. It is hard a problem for sure, but the problem is the rendering, you can't escape that by rastering it, that is rendering.
So it is absurd to pretend you can solve the rendering problem by rendering it into an image instead of a structured format. By rendering it into a raster, now you have 3 problems, parsing the PDF, rendering quality raster, then OCR'ing the raster. It is mind numbingly absurd.
It is a bit more involved, we have a rule engine that is fine tuned over time and works on most of invoices, there is also an experimental AI based engine that we are running in parallel but the rule based Engine still wins on old invoices.
Rendering is a different problem from understanding what's rendered.
If your PDF renders a part of the sentence at the beginning of the document, a part in the middle, and a part at the end, split between multiple sections, it's still rather trivial to render.
To parse and understand that this is the same sentence? A completely different matter.
Computers "don't understand" things. They process things, and what you're saying is called layoutinng which is a key part of PDF rendering. I do understand for someone unfamiliar with the internals of file formats, parsing, text shapping, and rendering in general, it all might seem like a blackmagic.
There are many cases images are exported as PDFs. Think invoices or financial statements that people send to financial services companies. Using layout understanding and OCR based techniques leads to way better results than writing a parser which relies on the files metadata.
The other thing is segmenting a document and linearizing it so that an LLM can understand the content better. Layout understanding helps with figuring out the natural reading order of various blocks of the page.
> Sounds like "I don't know programming, so I will just use AI".
If you were leading Tensorlake, running on early stage VC with only 10 employees (https://pitchbook.com/profiles/company/594250-75), you'd focus all your resources on shipping products quickly, iterating over unseen customer needs that could make the business skyrocket, and making your customers so happy that they tell everyone and buy lots more licenses.
Because you're a stellar tech leader and strategist, you wouldn't waste a penny reinventing low-level plumbing that's available off-the-shelf, either cheaply or as free OSS. You'd be thinking about the inevitable opportunity costs: If I build X then I can't build Y, simply because a tiny startup doesn't have enough resources to build X and Y. You'd quickly conclude that building a homegrown, robust PDF parser would be an open-ended tar pit that precludes us from focusing on making our customers happy and growing the business.
And the rest of us would watch in awe, seeing truly great tech leadership at work, making it all look easy.
I would hire someone who understands PDFs instead of doing the equivalent of printing a digital document and scanning it for "digital record keeping". Stop everything and hire someone who understands the basics of data processing and some PDF.
PDFs don't always lay out characters in sequence, sometimes they have absolutely positioned individual characters instead.
PDFs don't always use UTF-8, sometimes they assign random-seeming numbers to individual glyphs (this is common if unused glyphs are stripped from an embedded font, for example)
But all those problems exist when rendering into a surface or rastering. I just don't understand how one thinks, this is a hard problem, let me make it harder by solving the problem into another kind of problem that is just as hard as solving it in the first place (PDF to structured data vs PDF to raster). And then solve the new problem, which is also hard. It is absurd.
Sometimes scanned documents are structured really weird, especially for tables. Visually, we can recognize the intention when it's rendered, and so can the AI, but you practically have to render it to recover the spatial context.
> instead of just using the "quality implementation" to actually get structured data out?
I suggest spending a few minutes using a PDF editor program with some real-world PDFs, or even just copying and pasting text from a range of different PDFs. These files are made up of cute-tricks and hacks that whatever produced them used to make something that visually works. The high-quality implementations just put the pixels where they're told to. The underlying "structured data" is a lie.
EDIT: I see from further down the thread that your experience of PDFs comes from programmatically generated invoice templates, which may explain why you think this way.
> How is it reasonable to render the PDF, rasterize it, OCR it, use AI, instead of just using the "quality implementation" to actually get structured data out?
Because the underlying "structured data" is never checked while the visual output is checked by dozens of people.
"Truth" is the stuff that the meatbags call "truth" as seen by their squishy ocular balls--what the computer sees doesn't matter.
I was wondering : is your method ultimately, produces a better parsing than the program you used to initially parse and display the pdf? Or is the value in unifying the parsing for different input parsers?
> This is exactly the reason why Computer Vision approaches for parsing PDFs works so well in the real world.
One of the biggest benefits of PDFs though is that they can contain invisible data. E.g. the spec allows me to embed cryptographic proof that I've worked at the companies I claim to have worked at within my resume. But a vision-based approach obviously isn't going to be able to capture that.
What software can be used to write and read this invisible data? I want to document continuous edits to published documents which cannot show these edits until they are reviewed, compiled and revised. I was looking at doing this in word, but we keep word and PDF versions of these documents.
If someone told me there was cryptographic proof of job experience in their PDF, I would probably just believe them because it’d be a weird thing to lie about.
Encrypted (and hidden) embedded information, e. g. documents, signatures, certificates, watermarks, and the like. To (legally-binding) standards, e. g. for notary, et cetera.
> "This is exactly the reason why Computer Vision approaches for parsing PDFs works so well in the real world."
Well, to be fair, in many cases there's no way around it anyway since the documents in question are only scanned images. And the hardest problems I've seen there are narrative typography artbooks, department store catalogs with complex text and photo blending, as well as old city maps.
So you parse PDFs, but also OCR images, to somehow get better results?
Do you know you could just use the parsing engine that renders the PDF to get the output? I mean, why raster it, OCR it, and then use AI? Sounds creating a problem to use AI to solve it.
We parse PDFs to convert them to text in a linearized fashion. The use case for this would be to use the content for downstream use cases - search engine, structured extraction, etc.
None of that changes the fact that to get a raster, you have to solve the PDF parsing/rendering problem anyways, so might as well get structured data out instead of pixels so that it now another problem (OCR).
Yes, but a lot of the improvement is coming from layout models and/or multimodal LLMs operating directly on the raster images, as opposed to via classical OCR. This gets better results because the PDF format does not necessarily impart reading order or semantic meaning; the only way to be confident you're reading it like a human would is to actually do so - to render it out.
Another thing is that most document parsing tasks are going to run into a significant volume of PDFs which are actually just a bunch of scans/images of paper, so you need to build this capability anyways.
LLMs aren't going to magically do more than what your PDF rendering engine does, rastering it and OCR'ing doesn't change anything. I am amazed at how many people actually think it is a sane idea.
I think there is some kind of misunderstanding. Sure, if you get somehow structured, machine-generated PDFs parsing them might be feasible.
But what about the "scanned" document part? How do you handle that? Your PDF rendering engine probably just says: image at pos x,y with size height,width.
So as parent says you have to OCR/AI that photo anyway and it seems that's also a feasible approach for "real" pdfs.
Okay, this sounds like "because some part of the road is rough, why don't we just drive in the ditch along the road way all the way, we could drive a tank, that would solve it"?
PDF is more like a glorified svg format than a word format.
It only contains info on how the document should look but so semantic information like sentences, paragraphs, etc. Just a bag of characters positioned in certain places.
Great rundown. One thing you didn't mention that I thought was interesting to note is incremental-save chains: the first startxref offset is fine, but the /Prev links that Acrobat appends on successive edits may point a few bytes short of the next xref. Most viewers (PDF.js, MuPDF, even Adobe Reader in "repair" mode) fall back to a brute-force scan for obj tokens and reconstruct a fresh table so they work fine while a spec-accurate parser explodes. Building a similar salvage path is pretty much necessary if you want to work with real-world documents that have been edited multiple times by different applications.
You're right, this was a fairly common failure state seen in the sample set. The previous reference or one in the reference chain would point to offset of 0 or outside the bounds of the file, or just be plain wrong.
What prompted this post was trying to rewrite the initial parse logic for my project PdfPig[0]. I had originally ported the Java PDFBox code but felt like it should be 'simple' to rewrite more performantly. The new logic falls back to a brute-force scan of the entire file if a single xref table or stream is missed and just relies on those offsets in the recovery path.
However it is considerably slower than the code before it and it's hard to have confidence in the changes. I'm currently running through a 10,000 file test-set trying to identify edge-cases.
That robustness-vs-throughput trade-off is such a staple of PDF parsing. My guess is that the new path is slower because the recovery scan now always walks the whole byte range and has to inflate any object streams it meets before it can trust the offsets even when the first startxref would have been fine.
The 10k-file test set sounds great for confidence-building. Are the failures clustering around certain producer apps like Word, InDesign, scanners, etc.? Or is it just long-tail randomness?
Reading the PR, I like the recovery-first mindset. If the common real-world case is that offsets lie, treating salvage as the default is arguably the most spec-conformant thing you can do. Slow-and-correct beats fast-and-brittle for PDFs any day.
As someone who has written a PDF parser - it's definitely one of the weirdest formats I've seen, and IMHO much of it is caused by attempting to be a mix of both binary and text; and I suspect at least some of these weird cases of bad "incorrect but close" xref offsets may be caused by buggy code that's dealing with LF/CR conversions.
What the article doesn't mention is a lot of newer PDFs (v1.5+) don't even have a regular textual xref table, but the xref table is itself inside an "xref stream", and I believe v1.6+ can have the option of putting objects inside "object streams" too.
Yeah I was a little surprised that this didn't go beyond the simplest xref table and get into streams and compression. Things don't seem that bad until you realize the object you want is inside a stream that's using a weird riff on PNG compression and its offset is in an xref stream that's flate compressed that's a later addition to the document so you need to start with a plain one at the end of the file and then consider which versions of which objects are where. Then there's that you can find documentation on 1.7 pretty easily, but up until 2 years ago, 2.0 doc was pay-walled.
Yes, you're right there are Linearized PDFs which are organized to enable parsing and display of the first page(s) without having to download the full file. I skipped those from the summary for now because they have a whole chunk of an appendix to themselves.
One of the very first programming projects I tried, after learning Python, was a PDF parser to try to automate grabbing maps for one of my DnD campaigns. It did not go well lol.
Thanks kindly for this well done and brave introduction. There are few people these days who'd even recognize the bare ASCII 'Postscript' form of a PDF at first sight. First step is to unroll into ASCII of course and remove the first wrapper of Flate/ZIP,LZW,RLE. I recently teased Gemini for accepting .PDF and not .EPUB (chapterized html inna zip basically, with almost-guaranteed paragraph streams of UTF-8) and it lamented apologetically that its pdf support was opaque and library oriented. That was very human of it. Aside from a quick recap of the most likely LZW wrapper format, a deep dive into Lineariziation and reordering the objects by 'first use on page X' and writing them out again preceding each page would be a good pain project.
UglyToad is a good name for someone who likes pain. ;-)
I convert the PDF into an image per page, then dump those images into either an OCR program (if the PDF is a single column) or a vision-LLM (for double columns or more complex layouts).
Some vision LLMs can accept PDF inputs directly too, but you need to check that they're going to convert to images and process those rather than attempting and failing to extract the text some other way. I think OpenAI, Anthropic and Gemini all do the images-version of this now, thankfully.
Sadly this makes some sense; pdf represents characters in the text as offsets into it's fonts, and often the fonts are incomplete fonts; so an 'A' in the pdf is often not good old ASCII 65. In theory there's two optional systems that should tell you it's an 'A' - except when they don't; so the only way to know is to use the font to draw it.
If you don't have a known set of PDF producers this is really the only way to safely consume PDF content. Type 3 fonts alone make pulling text content out unreliable or impossible, before even getting to PDFs containing images of scans.
I expect the current LLMs significantly improve upon the previous ways of doing this, e.g. Tesseract, when given an image input? Is there any test you're aware of for model capabilities when it comes to ingesting PDFs?
I've been trying it informally and noting that it's getting really good now - Claude 4 and Gemini 2.5 seem to do a perfect job now, though I'm still paranoid that some rogue instruction in the scanned text (accidental or deliberate) might result in an inaccurate result.
I remember having a prior boss of mine asked if the application the company I was working for made could use PDF as an input. His response was to laugh then say "No, there is no coming back from chaos." The article has only reinforced that he was right.
I parsed the original Illustrator format in 1988 or 1989, which is a precursor to PDF. It was simpler than today's PDF, but of course I had zero documentation to guide me. I was mostly interested in writing Illustrator files, not importing them, so it was easier than this.
Last weekend I was trying to convert some PDF of Upanishads which contains some Sanskrit and English word.
By god its so annoying, I don't think I would be able to without the help of Claude Code with it just reiterating different libraries and methods over and over again.
Can we just write things in markdown from now on? I really, really, really, don't care that the images you put is nicely aligned to the right side and every is boxed together nicely.
Just give me the text and let me render it however I want on my end.
Whole point of PDF is that it's digital paper. It's up to the author how he wants to design it, just like a written note or something printed out and handed to you in person.
I did some exploration using LLMs to parse, understand then fill in PDFs. It was brutal but doable. I don't think I could build a "generalized" solution like this without LLMs. The internals are spaghetti!
Also, god bless the open source developers. Without them also impossible to do this in a timely fashion. pymupdf is incredible.
Also, absolutely not to your "single file HTML" theory: it would still allow javascript, random image formats (via data: URIs), conversely I don't _think_ that one can embed fonts in a single file HTML (e.g. not using the same data: URI trick), and to the best of my knowledge there's no cryptographic signing for HTML at all
It would also suffer from the linearization problem mentioned elsewhere in that one could not display the document if it were streaming in (the browsers work around this problem by just janking items around as the various .css and .js files resolve and parse)
I've also heard people cite DjVu https://en.wikipedia.org/wiki/DjVu as an alternative but I've never had good experience with it, its format doesn't appear to be an ECMA standard, and (lol) its linked reference file is a .pdf
As it happens, we already have "HTML as a document format". It's the EPUB format for ebooks, and it's just a zip file filled with an HTML document, images, and XML metadata. The only limitation is that all viewers I know of are geared toward rewrapping the content according to the viewport (which makes sense for ebooks), though the newer specifications include an option for fixed-layout content.
The answer seems obvious to me:
From a practical standpoint: my first name is Geoff. Half the resume parsers out there interpret my name as "Geo" and "ff" separately. Because that's how the text gets placed into the PDF. This happens out of multiple source applications.probably because ff is rendered as a ligature
Disclaimer - Founder of Tensorlake, we built a Document Parsing API for developers.
This is exactly the reason why Computer Vision approaches for parsing PDFs works so well in the real world. Relying on metadata in files just doesn't scale across different source of PDFs.
We convert PDFs to images, run a layout understanding model on them first, and then apply specialized models like text recognition and table recognition models on them, stitch them back together to get acceptable results for domains where accuracy is table stakes.
It might sound absurd, but on paper this should be the best way to approach the problem.
My understanding is that PDFs are intended to produce an output that is consumed by humans and not by computers, the format seems to be focused on how to display some data so that a human can (hopefully) easily read them. Here it seems that we are using a technique that mimics the human approach, which would seem to make sense.
It is sad though that in 30+ years we didn't manage to add a consistent way to include a way to make a PDF readable by a machine. I wonder what incentives were missing that didn't make this possible. Does anyone maybe have some insight here?
Probably for the same reason images were not readable by machines.
Except PDFs dangle hope of maybe being machine-readable because they can contain unicode text, while images don't offer this hope.
> the format seems to be focused on how to display some data so that a human can (hopefully) easily read them
It may seem so, but what it really focuses on is how to arrange stuff on a page that has to be printed. Literally everything else, from forms to hyperlinks, were later additions (and it shows, given the crater-size security holes they punched into the format)
Kinda funny.
Printing a PDF and scanning it for an email it would normally be worthy of major ridicule.
But you’re basically doing that to parse it.
I get it, have heard of others doing the same. Just seems damn frustrating that such is necessary. The world sure doesn’t parse HTML that way!
I've built document parsing pipelines for a few clients recently, and yeah this approach yields way superior results using what's currently available. Which is completely absurd, but here we are.
I've done only one pipeline trying parse actual PDF structure and the least surprising part of it is that some documents have top-to-bottom layout and others have bottom-to-top, flipped, with text flipped again to be readable. It only goes worse from there. Absurd is correct.
Jesus Christ. What other approaches did you try?
If the html in question would include javascript that renders everything, including text, into a canvas -- yes, this is how you would parse it. And PDF is basically that
Maybe not literally that, but the eldritch horrors of parsing real-world HTML are not to be taken lightly!
While we have a PDF internals expert here, I'm itching to ask: Why is mupdf-gl so much faster than everything else? (on vanilla desktop linux)
Its search speed on big pdfs is dramatically faster than everything else I've tried and I've often wondered why the others can't be as fast as mupdf-gl.
Thanks for any insights!
While you're doing this, please also tell people to stop producing PDF files in the first place, so that eventually the number of new PDFs can drop to 0. There's no hope for the format ever since manager types decided that it is "a way to put paper in the computer" and not the publishing intermediate format it was actually supposed to be. A vague facsimile of digitization that should have never taken off the way it did.
This has close to zero relevance to the OP.
I think it's a useful insight for people working on RAG using LLMs.
Devs working on RAG have to decide between parsing PDFs or using computer vision or both.
The author of the blog works on PdfPig, a framework to parse PDFs. For its document understanding APIs, it uses a hybrid approach that combines basic image understanding algorithms with PDF metadata . https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Anal...
GP's comment says a pure computer vision approach may be more effective in many real-world scenarios. It's an interesting insight since many devs would assume that pure computer vision is probably the less capable but also more complex approach.
As for the other comments that suggest directly using a parsing library's rendering APIs instead of rasterizing the end result, the reason is that detecting high-level visual objects (like tables , headings, and illustrations) and getting their coordinates is far easier using vision models than trying to infer those structures by examining hundreds of PDF line, text, glyph, and other low-level PDF objects. I feel those commentators have never tried to extract high-level structures from PDF object models. Try it once using PdfBox, Fitz, etc. to understand the difficulty. PDF really is a terrible format!
It's a good ad tho
So you've outsourced the parsing to whatever software you're using to render the PDF as an image.
Seems like a fairly reasonable decision given all the high quality implementations out there.
How is it reasonable to render the PDF, rasterize it, OCR it, use AI, instead of just using the "quality implementation" to actually get structured data out? Sounds like "I don't know programming, so I will just use AI".
> How is it reasonable to render the PDF, rasterize it, OCR it, use AI, instead of just using the "quality implementation" to actually get structured data out?
Because PDFs might not have the data in a structured form; how would you get the structured data out of an image in the PDF?
As someone who had to parse form data from a pdf, where the pdf author named the inputs TextField1 TextField2 TextFueld3 etc.
Misspellings, default names, a mixture, home brew naming schemes, meticulous schemes, I’ve seen it all. It’s definitely easier to just rasterize it and OCR it.
Same. Then someone edits the form and changes the names of several inputs, obsoleting much of the previous work, some of which still needs to be maintained because multiple versions are floating around.
I do PDF for a living, millions of PDFs per month, this is complete nonsense. There is no way you get better results from rastering and OCR than rendering into XML or other structured data.
How many different PDF generators have done those millions of PDFs tho?
Because you're right if you're paid to evaluate all the formats with the Mark 1 eyeball and do a custom parser for each. It sounds like it's feasible for your application.
If you want a generic solution that doesn't rely on a human spending a week figuring out that those 4 absolutely positioned text fields are the invoice number together (and in order 1 4 2 3), maybe you're wrong.
Source: I don't parse pdfs for a living, but sometimes I have to select text out of pdf schematics. A lot of times I just give up and type what my Mark 1 eyeball sees in a text editor.
We process invoices from around the world, so more PDF generators than I care to count. It is hard a problem for sure, but the problem is the rendering, you can't escape that by rastering it, that is rendering.
So it is absurd to pretend you can solve the rendering problem by rendering it into an image instead of a structured format. By rendering it into a raster, now you have 3 problems, parsing the PDF, rendering quality raster, then OCR'ing the raster. It is mind numbingly absurd.
You are using the Mark 1 eyeball for each new type of invoice to figure out what field goes where, right?
It is a bit more involved, we have a rule engine that is fine tuned over time and works on most of invoices, there is also an experimental AI based engine that we are running in parallel but the rule based Engine still wins on old invoices.
Rendering is a different problem from understanding what's rendered.
If your PDF renders a part of the sentence at the beginning of the document, a part in the middle, and a part at the end, split between multiple sections, it's still rather trivial to render.
To parse and understand that this is the same sentence? A completely different matter.
Computers "don't understand" things. They process things, and what you're saying is called layoutinng which is a key part of PDF rendering. I do understand for someone unfamiliar with the internals of file formats, parsing, text shapping, and rendering in general, it all might seem like a blackmagic.
There are many cases images are exported as PDFs. Think invoices or financial statements that people send to financial services companies. Using layout understanding and OCR based techniques leads to way better results than writing a parser which relies on the files metadata.
The other thing is segmenting a document and linearizing it so that an LLM can understand the content better. Layout understanding helps with figuring out the natural reading order of various blocks of the page.
I think it's reasonable because their models are probably trained on images, and not whatever "structured data" you may get out of a PDF.
Yes this! We training it on a ton of diverse document images to learn reading order and layouts of documents :)
But you have to render the PDF to get an image, right? How do you go from PDF to raster?
No model can do better on images than structured data. I am not sure if I am on crack or you're all talking nonsense.
> Sounds like "I don't know programming, so I will just use AI".
If you were leading Tensorlake, running on early stage VC with only 10 employees (https://pitchbook.com/profiles/company/594250-75), you'd focus all your resources on shipping products quickly, iterating over unseen customer needs that could make the business skyrocket, and making your customers so happy that they tell everyone and buy lots more licenses.
Because you're a stellar tech leader and strategist, you wouldn't waste a penny reinventing low-level plumbing that's available off-the-shelf, either cheaply or as free OSS. You'd be thinking about the inevitable opportunity costs: If I build X then I can't build Y, simply because a tiny startup doesn't have enough resources to build X and Y. You'd quickly conclude that building a homegrown, robust PDF parser would be an open-ended tar pit that precludes us from focusing on making our customers happy and growing the business.
And the rest of us would watch in awe, seeing truly great tech leadership at work, making it all look easy.
I would hire someone who understands PDFs instead of doing the equivalent of printing a digital document and scanning it for "digital record keeping". Stop everything and hire someone who understands the basics of data processing and some PDF.
PDFs don't always lay out characters in sequence, sometimes they have absolutely positioned individual characters instead.
PDFs don't always use UTF-8, sometimes they assign random-seeming numbers to individual glyphs (this is common if unused glyphs are stripped from an embedded font, for example)
etc etc
But all those problems exist when rendering into a surface or rastering. I just don't understand how one thinks, this is a hard problem, let me make it harder by solving the problem into another kind of problem that is just as hard as solving it in the first place (PDF to structured data vs PDF to raster). And then solve the new problem, which is also hard. It is absurd.
I don’t think people are suggesting : Build a renderer > build an ocr pipeline > run it on pdfs
I think people are suggesting : Use a readymade renderer > use readymade OCR pipelines/apis > run it on pdfs
A colleague uses a document scanner to create a pdf of a document and sends it to you
You must return the data represented in it retaining as much structure as possible
How would you proceed? Return just the metadata of when the scan was made and how?
Genuinely wondering
You can use an existing readymade renderer to render into structured data instead of raster.
Sometimes scanned documents are structured really weird, especially for tables. Visually, we can recognize the intention when it's rendered, and so can the AI, but you practically have to render it to recover the spatial context.
> instead of just using the "quality implementation" to actually get structured data out?
I suggest spending a few minutes using a PDF editor program with some real-world PDFs, or even just copying and pasting text from a range of different PDFs. These files are made up of cute-tricks and hacks that whatever produced them used to make something that visually works. The high-quality implementations just put the pixels where they're told to. The underlying "structured data" is a lie.
EDIT: I see from further down the thread that your experience of PDFs comes from programmatically generated invoice templates, which may explain why you think this way.
> How is it reasonable to render the PDF, rasterize it, OCR it, use AI, instead of just using the "quality implementation" to actually get structured data out?
Because the underlying "structured data" is never checked while the visual output is checked by dozens of people.
"Truth" is the stuff that the meatbags call "truth" as seen by their squishy ocular balls--what the computer sees doesn't matter.
I was wondering : is your method ultimately, produces a better parsing than the program you used to initially parse and display the pdf? Or is the value in unifying the parsing for different input parsers?
> This is exactly the reason why Computer Vision approaches for parsing PDFs works so well in the real world.
One of the biggest benefits of PDFs though is that they can contain invisible data. E.g. the spec allows me to embed cryptographic proof that I've worked at the companies I claim to have worked at within my resume. But a vision-based approach obviously isn't going to be able to capture that.
What software can be used to write and read this invisible data? I want to document continuous edits to published documents which cannot show these edits until they are reviewed, compiled and revised. I was looking at doing this in word, but we keep word and PDF versions of these documents.
If that stuff is stored as structured metadata extracting that should be trivial
Cryptographic proof of job experience? Please explain more. Sounds interesting.
If someone told me there was cryptographic proof of job experience in their PDF, I would probably just believe them because it’d be a weird thing to lie about.
Encrypted (and hidden) embedded information, e. g. documents, signatures, certificates, watermarks, and the like. To (legally-binding) standards, e. g. for notary, et cetera.
Yeah we don't handle this yet.
> "This is exactly the reason why Computer Vision approaches for parsing PDFs works so well in the real world."
Well, to be fair, in many cases there's no way around it anyway since the documents in question are only scanned images. And the hardest problems I've seen there are narrative typography artbooks, department store catalogs with complex text and photo blending, as well as old city maps.
So you parse PDFs, but also OCR images, to somehow get better results?
Do you know you could just use the parsing engine that renders the PDF to get the output? I mean, why raster it, OCR it, and then use AI? Sounds creating a problem to use AI to solve it.
We parse PDFs to convert them to text in a linearized fashion. The use case for this would be to use the content for downstream use cases - search engine, structured extraction, etc.
None of that changes the fact that to get a raster, you have to solve the PDF parsing/rendering problem anyways, so might as well get structured data out instead of pixels so that it now another problem (OCR).
Yes, but a lot of the improvement is coming from layout models and/or multimodal LLMs operating directly on the raster images, as opposed to via classical OCR. This gets better results because the PDF format does not necessarily impart reading order or semantic meaning; the only way to be confident you're reading it like a human would is to actually do so - to render it out.
Another thing is that most document parsing tasks are going to run into a significant volume of PDFs which are actually just a bunch of scans/images of paper, so you need to build this capability anyways.
TL;DR: PDFs are basically steganography
Hard no.
LLMs aren't going to magically do more than what your PDF rendering engine does, rastering it and OCR'ing doesn't change anything. I am amazed at how many people actually think it is a sane idea.
I think there is some kind of misunderstanding. Sure, if you get somehow structured, machine-generated PDFs parsing them might be feasible.
But what about the "scanned" document part? How do you handle that? Your PDF rendering engine probably just says: image at pos x,y with size height,width.
So as parent says you have to OCR/AI that photo anyway and it seems that's also a feasible approach for "real" pdfs.
Okay, this sounds like "because some part of the road is rough, why don't we just drive in the ditch along the road way all the way, we could drive a tank, that would solve it"?
Doesn't rendering to an image require proper parsing of the PDF?
PDF is more like a glorified svg format than a word format.
It only contains info on how the document should look but so semantic information like sentences, paragraphs, etc. Just a bag of characters positioned in certain places.
Yes, and don't for a second think this approach of rastering and OCR'ing is sane, let alone a reasonable choice. It is outright absurd.
This is the parallel of some of the dotcom peak absurdities. We are in the AI peak now.
Thanks for the pointer!
How ridiculous.
`mutool convert -o <some-txt-file-name.txt> -F text <somefile.pdf>`
Disclaimer: I work at a company that generates and works with PDFs.
Great rundown. One thing you didn't mention that I thought was interesting to note is incremental-save chains: the first startxref offset is fine, but the /Prev links that Acrobat appends on successive edits may point a few bytes short of the next xref. Most viewers (PDF.js, MuPDF, even Adobe Reader in "repair" mode) fall back to a brute-force scan for obj tokens and reconstruct a fresh table so they work fine while a spec-accurate parser explodes. Building a similar salvage path is pretty much necessary if you want to work with real-world documents that have been edited multiple times by different applications.
You're right, this was a fairly common failure state seen in the sample set. The previous reference or one in the reference chain would point to offset of 0 or outside the bounds of the file, or just be plain wrong.
What prompted this post was trying to rewrite the initial parse logic for my project PdfPig[0]. I had originally ported the Java PDFBox code but felt like it should be 'simple' to rewrite more performantly. The new logic falls back to a brute-force scan of the entire file if a single xref table or stream is missed and just relies on those offsets in the recovery path.
However it is considerably slower than the code before it and it's hard to have confidence in the changes. I'm currently running through a 10,000 file test-set trying to identify edge-cases.
[0]: https://github.com/UglyToad/PdfPig/pull/1102
That robustness-vs-throughput trade-off is such a staple of PDF parsing. My guess is that the new path is slower because the recovery scan now always walks the whole byte range and has to inflate any object streams it meets before it can trust the offsets even when the first startxref would have been fine.
The 10k-file test set sounds great for confidence-building. Are the failures clustering around certain producer apps like Word, InDesign, scanners, etc.? Or is it just long-tail randomness?
Reading the PR, I like the recovery-first mindset. If the common real-world case is that offsets lie, treating salvage as the default is arguably the most spec-conformant thing you can do. Slow-and-correct beats fast-and-brittle for PDFs any day.
> So you want to parse a PDF?
Absolutely not. For the reasons in the article.
Would be nice if my banks provided records in a more digestible format, but until then, I have no choice.
I find it pretty sad that for some banks the CSV export is behind a paywall.
No shit. I've made that mistake before, not gonna try it again.
As someone who has written a PDF parser - it's definitely one of the weirdest formats I've seen, and IMHO much of it is caused by attempting to be a mix of both binary and text; and I suspect at least some of these weird cases of bad "incorrect but close" xref offsets may be caused by buggy code that's dealing with LF/CR conversions.
What the article doesn't mention is a lot of newer PDFs (v1.5+) don't even have a regular textual xref table, but the xref table is itself inside an "xref stream", and I believe v1.6+ can have the option of putting objects inside "object streams" too.
Yeah I was a little surprised that this didn't go beyond the simplest xref table and get into streams and compression. Things don't seem that bad until you realize the object you want is inside a stream that's using a weird riff on PNG compression and its offset is in an xref stream that's flate compressed that's a later addition to the document so you need to start with a plain one at the end of the file and then consider which versions of which objects are where. Then there's that you can find documentation on 1.7 pretty easily, but up until 2 years ago, 2.0 doc was pay-walled.
Yeah, I was really surprised to learn that Paeth prediction really improves the compression ratio of xref tables a lot!
Yeah, PDF didn't anticipate streaming. That pesky trailer dictionary at the end means you have to wait for the file to fully load to parse it.
Having said that, I believe there are "streamable" PDF's where there is enough info up front to render the first page (but only the first page).
(But I have been out of the PDF loop for over a decade now so keep that in mind.)
Yes, you're right there are Linearized PDFs which are organized to enable parsing and display of the first page(s) without having to download the full file. I skipped those from the summary for now because they have a whole chunk of an appendix to themselves.
One of the very first programming projects I tried, after learning Python, was a PDF parser to try to automate grabbing maps for one of my DnD campaigns. It did not go well lol.
Thanks kindly for this well done and brave introduction. There are few people these days who'd even recognize the bare ASCII 'Postscript' form of a PDF at first sight. First step is to unroll into ASCII of course and remove the first wrapper of Flate/ZIP,LZW,RLE. I recently teased Gemini for accepting .PDF and not .EPUB (chapterized html inna zip basically, with almost-guaranteed paragraph streams of UTF-8) and it lamented apologetically that its pdf support was opaque and library oriented. That was very human of it. Aside from a quick recap of the most likely LZW wrapper format, a deep dive into Lineariziation and reordering the objects by 'first use on page X' and writing them out again preceding each page would be a good pain project.
UglyToad is a good name for someone who likes pain. ;-)
See https://digitalcorpora.org/corpora/file-corpora/cc-main-2021... for a set of 8 million PDF files from the web, as seen by a single crawl of Common Crawl.
I convert the PDF into an image per page, then dump those images into either an OCR program (if the PDF is a single column) or a vision-LLM (for double columns or more complex layouts).
Some vision LLMs can accept PDF inputs directly too, but you need to check that they're going to convert to images and process those rather than attempting and failing to extract the text some other way. I think OpenAI, Anthropic and Gemini all do the images-version of this now, thankfully.
Sadly this makes some sense; pdf represents characters in the text as offsets into it's fonts, and often the fonts are incomplete fonts; so an 'A' in the pdf is often not good old ASCII 65. In theory there's two optional systems that should tell you it's an 'A' - except when they don't; so the only way to know is to use the font to draw it.
If you don't have a known set of PDF producers this is really the only way to safely consume PDF content. Type 3 fonts alone make pulling text content out unreliable or impossible, before even getting to PDFs containing images of scans.
I expect the current LLMs significantly improve upon the previous ways of doing this, e.g. Tesseract, when given an image input? Is there any test you're aware of for model capabilities when it comes to ingesting PDFs?
I've been trying it informally and noting that it's getting really good now - Claude 4 and Gemini 2.5 seem to do a perfect job now, though I'm still paranoid that some rogue instruction in the scanned text (accidental or deliberate) might result in an inaccurate result.
pdfgrep (as a command line utility) is pretty great if one simply needs to search text in PDF files https://pdfgrep.org/
I remember having a prior boss of mine asked if the application the company I was working for made could use PDF as an input. His response was to laugh then say "No, there is no coming back from chaos." The article has only reinforced that he was right.
I parsed the original Illustrator format in 1988 or 1989, which is a precursor to PDF. It was simpler than today's PDF, but of course I had zero documentation to guide me. I was mostly interested in writing Illustrator files, not importing them, so it was easier than this.
Can you just ignore the index and read the entire file to find all the objects?
Last weekend I was trying to convert some PDF of Upanishads which contains some Sanskrit and English word.
By god its so annoying, I don't think I would be able to without the help of Claude Code with it just reiterating different libraries and methods over and over again.
Can we just write things in markdown from now on? I really, really, really, don't care that the images you put is nicely aligned to the right side and every is boxed together nicely.
Just give me the text and let me render it however I want on my end.
Whole point of PDF is that it's digital paper. It's up to the author how he wants to design it, just like a written note or something printed out and handed to you in person.
This is one of those things that seems like it shouldn't be that hard until you start to dig in.
I did some exploration using LLMs to parse, understand then fill in PDFs. It was brutal but doable. I don't think I could build a "generalized" solution like this without LLMs. The internals are spaghetti!
Also, god bless the open source developers. Without them also impossible to do this in a timely fashion. pymupdf is incredible.
https://www.linkedin.com/posts/sergiotapia_completed-a-reall...
As a matter of urgency PDF needs to go the way of Flash, same goes for TTF. Those that know, know why.
I think a PDF 2.0 would just be an extension of a single file HTML page with a fixed viewport
I presume you meant that as "PDF next generation" because PDF 2.0 already exists https://en.wikipedia.org/wiki/History_of_PDF#ISO_32000-2:_20...
Also, absolutely not to your "single file HTML" theory: it would still allow javascript, random image formats (via data: URIs), conversely I don't _think_ that one can embed fonts in a single file HTML (e.g. not using the same data: URI trick), and to the best of my knowledge there's no cryptographic signing for HTML at all
It would also suffer from the linearization problem mentioned elsewhere in that one could not display the document if it were streaming in (the browsers work around this problem by just janking items around as the various .css and .js files resolve and parse)
I'd offer Open XPS as an alternative even given its Empire of Evil origins because I'll take XML over a pseudo-text-pseudo-binary file format all day every day https://en.wikipedia.org/wiki/Open_XML_Paper_Specification#C...
I've also heard people cite DjVu https://en.wikipedia.org/wiki/DjVu as an alternative but I've never had good experience with it, its format doesn't appear to be an ECMA standard, and (lol) its linked reference file is a .pdf
As it happens, we already have "HTML as a document format". It's the EPUB format for ebooks, and it's just a zip file filled with an HTML document, images, and XML metadata. The only limitation is that all viewers I know of are geared toward rewrapping the content according to the viewport (which makes sense for ebooks), though the newer specifications include an option for fixed-layout content.
you can "just" enforce pdf/a
...well there is like 50 different pdf/a versions; just pick one of them :)
That and only commercial pdf libraries support PDF/A. Apperantly, it is much harder than regular PDF so open source libs dont bother.
founder of mixpeek here, we fine-tune late interaction models on pdfs based on domain https://mixpeek.com/extractors
Do you offer local or on-premise models? There are certain PDF's we cannot send to an API.