DeepSeek OCR

(github.com)

287 points | by pierre 4 hours ago ago

74 comments

krackers 3 hours ago

The paper is more interesting than just another VLM for OCR, they start talking about compression and stuff. E.g. there is this quote

>Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required to decode text tokens. The preliminary results are encouraging: DeepSeek-OCR achieves near-lossless OCR compression at approximately 10× ratios, while 20× compression still retains 60% accuracy.

(I guess you could say a picture token is worth 10 textual tokens...)

Could someone explain to a noob what the information-theoretic intuition is here? Why does this work, is it that text tokens are still too "granular"/repetitive and don't come close to the ideal entropy coding? Or is switching to vision tokens escaping the limitation of working "one word-ish at a time", allowing you to get closer to entropy (similar to the way that arithmetic encoding does compared to huffman codes)?

And then they start talking about handling long-context by literally(?) downscaling images, forming a correspondence between information loss in the textual domain and the image domain.

[-]

runeblaze 2 hours ago

each text token is often subword unit, but in VLMs the visual tokens are in semantic space. Semantic space obviously compresses much more than subword slices.

disclaimer: not expert, on top of my head

looobay 3 hours ago

LLMs are compute heavy with quadratic scaling (in compute) per tokens. They are trying to compress text tokens into visual tokens with their VLM.

Maybe they would render texts to an image before tokenizing to reduce the compute cost.

[-]

krackers 3 hours ago

But naively wouldn't you expect the representation of a piece of text in terms of vision tokens to be roughly the same number of bits (or more) than the representation as textual token? You're changing representation sure, but that by itself doesn't give you any compute advantages unless there is some sparsity/compressability you can take advantage of in the domain you transform to right?

So I guess my question is where is the juice being squeezed from, why does the vision token representation end up being more efficient than text tokens.

[-]

f33d5173 3 hours ago

Vision is how humans see text. So text must have built in adaptations to protect from visual noise. For example, two words that look similar must never appear in similar contexts, or else they would be conflated. Hence we can safely reduce such words to the same token. Or something like that.

psb217 2 hours ago

The trick is that the vision tokens are continuous valued vectors, while the text tokens are elements from a small discrete set (which are converted into continuous valued vectors by a lookup table). So, vision tokens can convey significantly more bits per token than text tokens. This allows them to pack the content of multiple text tokens into a single vision token.

imjonse 3 hours ago

I wonder if text written using chinese characters is more compatible with such vision centric compression than latin text.

looobay 3 hours ago

Vision tokens are a good compression medium because with one vision token you have one vector of N elements, but with textual tokens you have M vectors of N elements, because one vision token represent multiple pixels (and possibly multiple words). This is why its a good compression medium for compute.

It will never be as precise as textual tokens but it can be really good as they show in the paper.

[-]

krackers 3 hours ago

>with one vision token you have one vector of N elements, but with textual tokens you have M vectors of N elements

Each vision token represents a 16x16 patch, but to fully cover a word you might need multiple vision tokens. So assuming that the embedding size of the vision token and text token is the same `d` (which I think has to be the case for multimodal models), then wouldn't the fair comparison be `x * d` elements for a sentence in terms of vision tokens, and `y * d` for the same sentence in terms of text tokens? I don't see how you could see a priori that x << y (especially by a factor of 10 as quoted in the paper).

That said, if I do experimentally try this by shrinking this very comment down to the smallest font size I can read it at, then seeing how many 16x16 tokens it takes, you can fit more text than I expected in each "vision token". So I can maybe buy that x is at least not greater than y. But it can't be as simple as "each vision token can cover more text", since that only enables better compression if the encoder can actually uncover some sort of redundancy within each token. (And presumably the type of redundancy it uncovers probably isn't something that "classical" compression techniques can exploit, otherwise it seems like it would have been tried by now?).

[-]

looobay 3 hours ago

You should read the 6th page of the paper (and page 5 for architecture breakdown), they show that they are compressing the vision tokens with convolution to keep a strong semantic understanding and keep a small amount of tokens.

But I think it's still experimentall.

numpad0 2 hours ago

just a hunch but like, from something to do with Unicode?

ellisd 2 hours ago

The paper makes no mention of Anna’s Archive. I wouldn’t be surprised if DeepSeek took advantage of Anna’s offer granting OCR researchers access to their 7.5 million (350 TB) Chinese non-fiction collection ... which is bigger than Library Genesis.

https://annas-archive.org/blog/duxiu-exclusive.html

[-]

throawayonthe an hour ago

hahaha also immediately thought of this, wonder when the ocr'd dataset would be getting released

yoran 3 hours ago

How does an LLM approach to OCR compare to say Azure AI Document Intelligence (https://learn.microsoft.com/en-us/azure/ai-services/document...) or Google's Vision API (https://cloud.google.com/vision?hl=en)?

[-]

numpad0 9 minutes ago

Classical OCR still probably make undesirable su6stıtutìons in CJK from there being far too many of similar ones, even some absurd ones that are only distinguishable under microscope or by looking at binary representations. LLMs are better constrained to valid sequences of characters, and so they would be more accurate.

Or at least that kind of thing would motivate them to re-implement OCR with LLM.

ozgune 3 hours ago

OmniAI has a benchmark that companies LLMs to cloud OCR services.

https://getomni.ai/blog/ocr-benchmark (Feb 2025)

Please note that LLMs progressed at a rapid pace since Feb. We see much better results with the Qwen3-VL family, particularly Qwen3-VL-235B-A22B-Instruct for our use-case.

sandblast 3 hours ago

Not sure why you're being downvoted, I'm also curious.

pietz 2 hours ago

My impression is that OCR is basically solved at this point.

The OmniAI benchmark that's also referenced here wasn't updated with new models since February 2025. I assume that's because general purpose LLMs have gotten better at OCR than their own OCR product.

I've been able to solve a broad range of OCR tasks by simply sending each page as an image to Gemini 2.5 Flash Lite and asking it nicely to extract the content in Markdown under some additional formatting instructions. That will cost you around $0.20 for 1000 pages in batch mode and the results have been great.

I'd be interested to hear where OCR still struggles today.

[-]

cahaya an hour ago

Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML: Tables with multiple headers and merged cells that get mixed up, multiple columns with tick boxes get mixed up, multi page tables that are not understood correctly. Also Llamaindex fails miserably on those things.

Curious to hear which OCR/ LLM excels with these specific issues? Example complex table: https://cdn.aviation.bot/complex-tables.zip

I can only parse this table correctly by first parsing the table headers manually into HTML as example output. However, it still mixes up tick boxes. Full table examples: https://www.easa.europa.eu/en/icao-compliance-checklist

[-]

CaptainOfCoit 36 minutes ago

> Lot's of OCR/ LLM's (Even Gemini Pro 2.5) still struggle converting complex tables to markdown or HTML:

But that's something else, that's no longer just OCR ("Optical Character Recognition"). If the goal suddenly changes from "Can take letters in images and make into digital text" to "Can replicate anything seen on a screen", the problem-space gets too big.

For those images you have, I'd use something like Magistral + Structured Outputs instead, first pass figure out what's the right structure to parse into, second pass to actually fetch and structure the data.

[-]

eeixlk 27 minutes ago

htceaad t nofdnsy lyruuieo sieerrr t owcope?

pietz 41 minutes ago

I threw in the first image/table into Gemini 2.5 Pro letting it choose the output format and it looks like it extracted the data just fine. It decided to represent the checkboxes as "checked" and "unchecked" because I didn't specify preferences.

raincole an hour ago

If you can accept that the machine just make up what it doesn't recognize instead of saying "I don't know," then yes it's solved.

(I'm not being snarky. It's acceptable in some cases.)

[-]

jakewins an hour ago

But this was very much the case with existing OCR software as well? I guess the LLMs will end up making up plausible looking text instead of text riddled with errors, which makes it much harder to catch the mistakes, in fairness

red75prime 19 minutes ago

Just checked it with Gemini 2.5 Flash. Instructing it to mark low-confidence words seems to work OK.

sbinnee 2 minutes ago

Maybe for English. Other languages are very much not solved.

carschno 2 hours ago

Technically not OCR, but HTR (hand-written text/transcript recognition) is still difficult. LLMs have increased accuracy, but their mistakes are very hard to identify because they just 'hallucinate' text they cannot digitize.

[-]

mormegil an hour ago

This. I am reading old vital records in my family genealogy quest, and as those are sometimes really difficult to read, I turned to LLMs, hearing they are great in OCR. It’s been… terrible. The LLM will transcribe the record without problems, the output seems completely correct, a typical text of a vital record. Just… the transcribed text has nothing to do with my specific record. On the other hand, transkribus.eu has been fairly usable for old vital record transcription – even though the transcribed text is far from perfect, many letters and words are recognized incorrectly, it helps me a lot with the more difficult records.

pietz an hour ago

We ran a small experiment internally on this and it looked like Gemini is better at handwriting recognition than I am. After seeing what it parsed, I was like "oh yeah, that's right". I do agree that instead of saying "Sorry, I can't read that" it just made up something.

sramam an hour ago

Interesting - have you tried sending the image and 'hallucinated' text together to a review LLM to fix mistakes?

I don't have a use case of 100s or 1000s of hand-written notes have to be transcribed. I have only done this with whiteboard discussion snapshots and it has worked really well.

robotswantdata an hour ago

VLLMs suck at complex layouts and there is a high risk of hallucination. Never use alone for contracts or health data.

peter-m80 an hour ago

No way it's solved. try to make OCR over a magazine with creative layouts. Not possible. I have a collection of vintage computer magazines and from time to time I try to OCR them whith the state of the art mechanisms. All of them requiere a lot of human intervention

[-]

pietz an hour ago

Could you provide an example that fails? I'm interested in this.

jmkni an hour ago

do you have an example of a particularly tricky one?

[-]

ekianjo an hour ago

Just try old ads you will see how hard it gets

kbumsik 2 hours ago

> My impression is that OCR is basically solved at this point.

Not really in practice to me. Especially they still struggle with Table format detection.

[-]

coulix 2 hours ago

This.

Any complex parent table span cell relationship still has low accuracy.

Try the reverse, take a complex picture table and ask Chatgpt5, claude Opus 3.1, Gemini Pro 2.5 to produce a HTML table.

They will fail.

[-]

pietz an hour ago

Maybe my imagination is limited or our documents aren't complex enough, but are we talking about realistic written documents? I'm sure you can take a screenshot of a very complex spreadsheet and it fails, but in that case you already have the data in structured form anyway, no?

bobsmooth an hour ago

Maybe I misunderstood the assignment but it seems to work for me.

https://chatgpt.com/share/68f5f9ba-d448-8005-86d2-c3fbae028b...

Edit: Just caught a mistake, transcribed one of the prices incorrectly.

[-]

kbumsik an hour ago

Right, I wouldn't use full table detection to VLM model because they tend to mistake with numbers in table...

vintermann an hour ago

OCR of printed text may be one thing, but handwriting OCR (a.k.a HTR) is very, very far from solved. It's actually hard to find a practical task general historical HTR is good enough to do usefully, even for state of the art models.

darkwater an hour ago

So, the mug with inspirational text says "Bountiful Potential"?

ammar_x 43 minutes ago

Language support is not mentioned in the repo. But from the paper, it offers extensive multilingual support (nearly 100 languages) which is good, but I need to test it to see how it compares to Gemini and Mistral OCR.

CloseChoice 3 hours ago

It's deepseek so one can expect an open-source license but for anyone (like me) who wants to see that explicitly, since it's not obvious in the GitHub repo: https://huggingface.co/deepseek-ai/DeepSeek-OCR/blob/main/LI...

TLDR: It's MIT licensed

[-]

AndroTux 3 hours ago

> since it's not obvious in the GitHub repo

Literally says MIT license on the right sidebar and in the readme tab and in the file called LICENSE

maxloh an hour ago

Model weights are MIT too: https://huggingface.co/deepseek-ai/DeepSeek-OCR/blob/main/LI...

piker 4 hours ago

This looks really cool for prototyping and playing around.

It seems to me though if one is building a modern application that needs to get image segmentation and/or text recognition right there are better APIs available than natural language? It seems like a lot of effort to make a production-scale CV application to weigh it down with all of an LLM’s shortcomings. Not a field I’m familiar with but I would assume that this doesn’t produce state of the art results—that would change the analysis.

[-]

CheeseFromLidl 6 minutes ago

As a hobby photographer, I organise everything for speedy retrieval but this would be amazing to search my collection.

randomNumber7 4 hours ago

Imagine you build an image segmentation model for a e.g. specific industrial application.

With this LLM approach you can at least create your training data from the raw images with natural language.

[-]

piker 4 hours ago

That does make sense

k_sze 3 hours ago

It's interesting how they use "Gundam" in their variant names. I gather that Gundam-M and Gundam are their most powerful ones.

mrasong 2 hours ago

Kinda reminds me of PaddleOCR.

Would be awesome if DeepSeek OCR could be integrated into a mobile app someday. That’d make OCR way more convenient!

[-]

pzo 40 minutes ago

iOS already have on device both text detector and document scanner in apple Vision API. Hard to say how good are they compared to LLM based solutions. Similarly google had MLKit with OCR working on devices also for many years.

x______________ 3 hours ago

  >先天下之忧而忧

How is this an example of a prompt?

Google translated this to "Worry about the world first" while Bing says "Worry before the worries of the world."

Can anyone shed some light on this saying or why it's in the article?

[-]

raincole 3 hours ago

It's a very famous (classical) Chinese phrase.

Both translations don't catch the meaning well though. It means: "worry before the rest of the world (notice that they have something to) worry." The next part is 後天下之樂而樂("be happy only after the rest of the world is happy.")

I don't know why it's a prompt example.

[-]

jdthedisciple 2 hours ago

Sibling comment has the second part as

后天下之乐而乐

which one is correct?

[-]

Y_Y 17 minutes ago

It depends on who you think is the rightful successor to the Qing dynasty

raincole 2 hours ago

Traditional vs Simplified Chinese.

There are two (modern) "spellings" of written Chinese. Basically colour vs color.

fspeech 3 hours ago

Google is closer. This is from a famous essay expressing tbe author's desire to bear the burden for the world. Essay is 岳阳楼记 by 范仲淹 in year 1046 https://zh.wikisource.org/zh-hans/%E5%B2%B3%E9%99%BD%E6%A8%9...

SequoiaHope 3 hours ago

Ask a language model - ChatGPT says it’s a line from a famous poem “Memorial to Yueyang Tower” which expresses the Confucian ideal of selfless concern for people and society.

gudzpoz 2 hours ago

This clause is usually used together with the next sentence in the original poem:

> 先天下之忧而忧，后天下之乐而乐

> (put the world's worries before yours, and put your happiness after the world's) > edit: this translation is wrong, and raincole has a definitely better translation

Since the model is a language model, they probably use this to demonstrate the model's language capabilities – the model should be able to complete the whole sentence pair. The paper also mentions this:

> To ensure the model’s language capabilities, we introduced 10% of in-house text-only pretrain data.

So I believe it is just a text-only demonstration.

[-]

jdthedisciple 2 hours ago

Sibling comment has the second part as

後天下之樂而樂

Which one is correct?

[-]

numpad0 17 minutes ago

  a) 后天下之乐而乐
  b) 後天下之樂而樂
  c) 後天下之楽而楽

a) is clearly Simplified Chinese from a sibling comment, b) is Traditional copied from your comment, and c) is as I just typed in my own language. Unicode Hanzi/Kanji are a mess and there are characters same or different, in appearance or in binary, depending on intended variants, languages, fonts, systems, keyboard, distance between Earth and Alpha Centauri, etc.

singularity2001 an hour ago

Instead of downloading a specific OCR model how would one fare just downloading the currently best multi-modal foundation model? And what would that be at less than 30 GB?

empressplay 4 hours ago

This could be great for extracting text from old magazines; traditional OCR gives you a bit of a mess you have to clean up, but this looks like it can properly identify columns and track the flow accurately (and extract images!) It appears it can convert magazine layouts to markdown too

farseer 4 hours ago

How good is this compared to most commercial OCR software?

[-]

ozim 4 hours ago

Any vision model is better than commercial OCR software.

[-]

Etheryte 3 hours ago

I'm not really sure if that's an accurate summary of the state of the art, [0] is a better overview. In short, SOTA multi-modal LLMs are the best option for handwriting, nearly anything is good at printed text, for printed media, specialty models from hyperscalers are slightly better than multi-modal LLMs.

[0] https://research.aimultiple.com/ocr-accuracy/

[-]

ozim 2 hours ago

I see it confirms what I wrote state of art is “not using tessaract anymore” and I think bunch of commercial solutions are stuck with tessaract.

[-]

ares623 2 hours ago

I assume Tesseract has the advantage of being able to give a confidence score?

brightUiso 3 hours ago

Please a bit of education, what does it do?

bugglebeetle 3 hours ago

Looks great, but looking at the benchmark, can’t help but think about how crazy good dots-ocr is as a model. Too bad they’re not as open as the Deepseek team because its so crazy good and would love to know how it was trained.

[-]

rfoo 3 hours ago

If you look you'd notice that it's the same Haoran Wei behind DeepSeek-OCR and GOT-OCR2.0 :p

bethekind 3 hours ago

Did we read the same graph? DeepSeek Gundam 200 dpi appeared to get similar perf as dots-ocr, but with less tokens needed. The x axis is inverted, descending with distance from the origin.