The paper is more interesting than just another VLM for OCR, they start talking about compression and stuff. E.g. there is this quote
>Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required to decode text tokens. The preliminary results are encouraging: DeepSeek-OCR achieves near-lossless OCR compression at approximately 10× ratios, while 20× compression still retains 60% accuracy.
(I guess you could say a picture token is worth 10 textual tokens...)
Could someone explain to a noob what the information-theoretic intuition is here? Why does this work, is it that text tokens are still too "granular"/repetitive and don't come close to the ideal entropy coding? Or is switching to vision tokens escaping the limitation of working "one word-ish at a time", allowing you to get closer to entropy (similar to the way that arithmetic encoding does compared to huffman codes)?
And then they start talking about handling long-context by literally(?) downscaling images, forming a correspondence between information loss in the textual domain and the image domain.
show comments
simonw
I figured out how to get this running on the NVIDIA Spark (ARM64, which makes PyTorch a little bit trickier than usual) by running Claude Code as root in a new Docker container and having it figure it out. Notes here: https://simonwillison.net/2025/Oct/20/deepseek-ocr-claude-co...
The paper makes no mention of Anna’s Archive. I wouldn’t be surprised if DeepSeek took advantage of Anna’s offer granting OCR researchers access to their 7.5 million (350 TB) Chinese non-fiction collection ... which is bigger than Library Genesis.
I haven't fired this up yet to try but I've been evaluating & working with quite a few different VLMs from the small granite, qwen etc models up to the larger VLMs available to see if we can fully replace traditional OCR in our system but I've been disappointed so far - our system takes documents from customers and supplies them back normalized documents (i.e rasterized multi-page bitmaps) marked up as they've requested - however in our use case we need accurate coordinates of data down to the letter/word level and from my experience the positional outputs from these VLMs are either wildly inconsistent, completely hallucinated, or so vague that it doesn't allow us to target anything with any kind of accuracy or granularity.
our solution so far has been to stick to using tesseract with good clean-up routines and then augmenting/fixing-up the output using the VLM OCR text where we don't have structured source document data available
it could be that we just have a very niche use-case and it doesn't matter to most people, I'm sure if you just want a text dump or restructured markdown/html representation of documents these VLMs work well but the number of articles & comments I've seen claiming that these models have 'solved' OCR just seems counter to our experiences
show comments
pietz
My impression is that OCR is basically solved at this point.
The OmniAI benchmark that's also referenced here wasn't updated with new models since February 2025. I assume that's because general purpose LLMs have gotten better at OCR than their own OCR product.
I've been able to solve a broad range of OCR tasks by simply sending each page as an image to Gemini 2.5 Flash Lite and asking it nicely to extract the content in Markdown under some additional formatting instructions. That will cost you around $0.20 for 1000 pages in batch mode and the results have been great.
I'd be interested to hear where OCR still struggles today.
show comments
rsp1984
Can someone ELI5 to me (someone who doesn't have the time to keep up with all the latest research) what this is and why it's a big deal?
It's very hard to guess from the github and paper. For example, there is OCR in the title but the abstract and readme.md talk about context compression for LLMs, which I find confusing. Somebody care to explain the link and provide some high-level context?
Is this only at the level of visual compression? For example, are there any applications in terms of understanding (being able to represent the actual meaning it stands for) and reasoning? Technically, it seems to have no connection with current reinforcement learning and other techniques. The model is quite small, yet there appears to be no explanation regarding its understanding capabilities. If it is merely for compression, what impact will it have on the current large models?
modeless
Hmm, at first I was thinking "why OCR?", but maybe the reason is to ingest more types of training data for LLM improvement, e.g. scanned academic papers? I imagine all the frontier labs have a solution for this due to the value of academic papers as a data source.
Edit: Oh I see the paper abstract says this explicitly: "In production, DeepSeek-OCR can generate training data for LLMs/VLMs
at a scale of 200k+ pages per day (a single A100-40G)". This is just part of the training data ingestion pipeline for their real models. Explains why the architecture is not using all of their latest tricks: it's already good enough for their use case and it's not the main focus.
edtechdev
I tried this out on huggingface, and it has the same issue as every other multimodal AI OCR option (including MinerU, olmOCR, Gemini, ChatGPT, ...). It ignores pictures, charts, and other visual elements in a document, even though the models are pretty good at describing images and charts by themselves. What this means is that you can't use these tools yet to create fully accessible alternatives to PDFs.
show comments
allanren
It says the conversation can reduce size with large compression, which basic make the image blur but still contaim import information.
This is indeed amazing. It's actually how human try to understand and remember things. BY VISUAL! And when memory fade out, the image are getting blurred.
Not sure if those close source multimodal models are already using this method.
schopra909
It’s not clear to me what the bottleneck for OCR to “100%” work with LLMS is.
In my work we do a lot of stuff with image understanding and captioning (not OCR). There object identification and description works great, since all the models are using a CLIP like visual backbone. But it falls apart when you ask about nuances like left/right or counting (reasoning kind of improves the latter but it’s too expensive to matter IMO).
For our tasks, it’s clear that there’s more fundamental research that needs to be done on vision understanding to push past CLIP. That would really improve LLMs for our usecases.
Curious if there’s something similar going on for OCR in the vision encoder that’s fundamentally holding it back.
neves
I see that the project uses conda for development. Is it still a good tool now that pip also install binaries?
Say I only care about reading serial numbers from photos in a manufacturing process, not whole document parsing. Using a 3B param model to do this seems like a bit of overkill...
prats226
Top 3 models on huggingface are all OCR models. Most automation projects involve documents where you need a model finetuned to understand all elements inside documents and provide grounding and confidence scores etc which is why these subset of models are gaining popularity
2big2fail_47
I find it interesting that there's all these independent AI-OCR Projects but still no commercial offering. Is it still too inaccurate, too complex or simply too expensive?
show comments
piker
This looks really cool for prototyping and playing around.
It seems to me though if one is building a modern application that needs to get image segmentation and/or text recognition right there are better APIs available than natural language? It seems like a lot of effort to make a production-scale CV application to weigh it down with all of an LLM’s shortcomings. Not a field I’m familiar with but I would assume that this doesn’t produce state of the art results—that would change the analysis.
The only model that I found so far that extract table data with OCR is dots.ocr. Models that came after it have not done a good job. Interesting on testing this new model.
ammar_x
Language support is not mentioned in the repo.
But from the paper, it offers extensive multilingual support (nearly 100 languages) which is good, but I need to test it to see how it compares to Gemini and Mistral OCR.
show comments
x______________
>先天下之忧而忧
How is this an example of a prompt?
Google translated this to "Worry about the world first" while Bing says "Worry before the worries of the world."
Can anyone shed some light on this saying or why it's in the article?
show comments
singularity2001
Instead of downloading a specific OCR model how would one fare just downloading the currently best multi-modal foundation model? And what would that be at less than 30 GB?
show comments
loaderchips
Great work guys, how about we replace the global encoder with a Mamba (state-space) vision backbone to eliminate the O(n²) attention bottleneck, enabling linear-complexity encoding of high-resolution documents. Pair this with a non-autoregressive (Non-AR) decoder—such as Mask-Predict or iterative refinement—that generates all output tokens in parallel instead of sequentially. Together, this creates a fully parallelizable vision-to-text pipeline, The combination addresses both major bottlenecks in DeepSeek-OCR.
show comments
hank2000
Have yall seen tensorlake? I’m curious how this compares to a model custom built for the problem. My guess is it can be as good. But can it be as efficient?
disclaimer: I do not work for tensorlake—but i know the folks behind it.
empressplay
This could be great for extracting text from old magazines; traditional OCR gives you a bit of a mess you have to clean up, but this looks like it can properly identify columns and track the flow accurately (and extract images!) It appears it can convert magazine layouts to markdown too
farseer
How good is this compared to most commercial OCR software?
show comments
brightUiso
Please a bit of education, what does it do?
bugglebeetle
Looks great, but looking at the benchmark, can’t help but think about how crazy good dots-ocr is as a model. Too bad they’re not as open as the Deepseek team because its so crazy good and would love to know how it was trained.
show comments
tinyhouse
OCR is not a great name for these models. While they can do traditional OCR such as digitize and scanned PDF for example, they do so much more.
show comments
joshstrange
> [2025/x/x] We release DeepSeek-OCR, a model to investigate the role of vision encoders from an LLM-centric viewpoint.
So close but it should be 2025/X/XX as "X" = 10 in Roman Numerals /s
Jokes aside, this is really neat and I'm looking forward to getting this running. For most OCR-type stuff I just use AWS Textract since I need it so rarely and that service does a decent job. I really like how well this model seems to extract images/figures as well from the original document.
The paper is more interesting than just another VLM for OCR, they start talking about compression and stuff. E.g. there is this quote
>Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required to decode text tokens. The preliminary results are encouraging: DeepSeek-OCR achieves near-lossless OCR compression at approximately 10× ratios, while 20× compression still retains 60% accuracy.
(I guess you could say a picture token is worth 10 textual tokens...)
Could someone explain to a noob what the information-theoretic intuition is here? Why does this work, is it that text tokens are still too "granular"/repetitive and don't come close to the ideal entropy coding? Or is switching to vision tokens escaping the limitation of working "one word-ish at a time", allowing you to get closer to entropy (similar to the way that arithmetic encoding does compared to huffman codes)?
And then they start talking about handling long-context by literally(?) downscaling images, forming a correspondence between information loss in the textual domain and the image domain.
I figured out how to get this running on the NVIDIA Spark (ARM64, which makes PyTorch a little bit trickier than usual) by running Claude Code as root in a new Docker container and having it figure it out. Notes here: https://simonwillison.net/2025/Oct/20/deepseek-ocr-claude-co...
Here's a result I got https://github.com/simonw/research/blob/main/deepseek-ocr-nv... - against this image: https://static.simonwillison.net/static/2025/ft.jpeg
The paper makes no mention of Anna’s Archive. I wouldn’t be surprised if DeepSeek took advantage of Anna’s offer granting OCR researchers access to their 7.5 million (350 TB) Chinese non-fiction collection ... which is bigger than Library Genesis.
https://annas-archive.org/blog/duxiu-exclusive.html
For everyone wondering how good this and other benchmarks are:
- the OmniAI benchmark is bad
- Instead check OmniDocBench[1] out
- Mistral OCR is far far behind most Open Source OCR models and even further behind then Gemini
- End to End OCR is still extremely tricky
- composed pipelines work better (layout detection -> reading order -> OCR every element)
- complex table parsing is still extremely difficult
[1]: https://github.com/opendatalab/OmniDocBench
How does an LLM approach to OCR compare to say Azure AI Document Intelligence (https://learn.microsoft.com/en-us/azure/ai-services/document...) or Google's Vision API (https://cloud.google.com/vision?hl=en)?
I haven't fired this up yet to try but I've been evaluating & working with quite a few different VLMs from the small granite, qwen etc models up to the larger VLMs available to see if we can fully replace traditional OCR in our system but I've been disappointed so far - our system takes documents from customers and supplies them back normalized documents (i.e rasterized multi-page bitmaps) marked up as they've requested - however in our use case we need accurate coordinates of data down to the letter/word level and from my experience the positional outputs from these VLMs are either wildly inconsistent, completely hallucinated, or so vague that it doesn't allow us to target anything with any kind of accuracy or granularity.
our solution so far has been to stick to using tesseract with good clean-up routines and then augmenting/fixing-up the output using the VLM OCR text where we don't have structured source document data available
it could be that we just have a very niche use-case and it doesn't matter to most people, I'm sure if you just want a text dump or restructured markdown/html representation of documents these VLMs work well but the number of articles & comments I've seen claiming that these models have 'solved' OCR just seems counter to our experiences
My impression is that OCR is basically solved at this point.
The OmniAI benchmark that's also referenced here wasn't updated with new models since February 2025. I assume that's because general purpose LLMs have gotten better at OCR than their own OCR product.
I've been able to solve a broad range of OCR tasks by simply sending each page as an image to Gemini 2.5 Flash Lite and asking it nicely to extract the content in Markdown under some additional formatting instructions. That will cost you around $0.20 for 1000 pages in batch mode and the results have been great.
I'd be interested to hear where OCR still struggles today.
Can someone ELI5 to me (someone who doesn't have the time to keep up with all the latest research) what this is and why it's a big deal?
It's very hard to guess from the github and paper. For example, there is OCR in the title but the abstract and readme.md talk about context compression for LLMs, which I find confusing. Somebody care to explain the link and provide some high-level context?
How does it compare to Tesseract? https://github.com/tesseract-ocr/tesseract
I use ocrmypdf (which uses Tesseract). Runs locally and is absolutely fantastic. https://ocrmypdf.readthedocs.io/en/latest/
Is this only at the level of visual compression? For example, are there any applications in terms of understanding (being able to represent the actual meaning it stands for) and reasoning? Technically, it seems to have no connection with current reinforcement learning and other techniques. The model is quite small, yet there appears to be no explanation regarding its understanding capabilities. If it is merely for compression, what impact will it have on the current large models?
Hmm, at first I was thinking "why OCR?", but maybe the reason is to ingest more types of training data for LLM improvement, e.g. scanned academic papers? I imagine all the frontier labs have a solution for this due to the value of academic papers as a data source.
Edit: Oh I see the paper abstract says this explicitly: "In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G)". This is just part of the training data ingestion pipeline for their real models. Explains why the architecture is not using all of their latest tricks: it's already good enough for their use case and it's not the main focus.
I tried this out on huggingface, and it has the same issue as every other multimodal AI OCR option (including MinerU, olmOCR, Gemini, ChatGPT, ...). It ignores pictures, charts, and other visual elements in a document, even though the models are pretty good at describing images and charts by themselves. What this means is that you can't use these tools yet to create fully accessible alternatives to PDFs.
It says the conversation can reduce size with large compression, which basic make the image blur but still contaim import information.
This is indeed amazing. It's actually how human try to understand and remember things. BY VISUAL! And when memory fade out, the image are getting blurred.
Not sure if those close source multimodal models are already using this method.
It’s not clear to me what the bottleneck for OCR to “100%” work with LLMS is.
In my work we do a lot of stuff with image understanding and captioning (not OCR). There object identification and description works great, since all the models are using a CLIP like visual backbone. But it falls apart when you ask about nuances like left/right or counting (reasoning kind of improves the latter but it’s too expensive to matter IMO).
For our tasks, it’s clear that there’s more fundamental research that needs to be done on vision understanding to push past CLIP. That would really improve LLMs for our usecases.
Curious if there’s something similar going on for OCR in the vision encoder that’s fundamentally holding it back.
I see that the project uses conda for development. Is it still a good tool now that pip also install binaries?
How does this compare to https://huggingface.co/ibm-granite/granite-docling-258M in performance and how they work?
Is there any 'small' OCR models around?
Say I only care about reading serial numbers from photos in a manufacturing process, not whole document parsing. Using a 3B param model to do this seems like a bit of overkill...
Top 3 models on huggingface are all OCR models. Most automation projects involve documents where you need a model finetuned to understand all elements inside documents and provide grounding and confidence scores etc which is why these subset of models are gaining popularity
I find it interesting that there's all these independent AI-OCR Projects but still no commercial offering. Is it still too inaccurate, too complex or simply too expensive?
This looks really cool for prototyping and playing around.
It seems to me though if one is building a modern application that needs to get image segmentation and/or text recognition right there are better APIs available than natural language? It seems like a lot of effort to make a production-scale CV application to weigh it down with all of an LLM’s shortcomings. Not a field I’m familiar with but I would assume that this doesn’t produce state of the art results—that would change the analysis.
It's deepseek so one can expect an open-source license but for anyone (like me) who wants to see that explicitly, since it's not obvious in the GitHub repo: https://huggingface.co/deepseek-ai/DeepSeek-OCR/blob/main/LI...
TLDR: It's MIT licensed
It's interesting how they use "Gundam" in their variant names. I gather that Gundam-M and Gundam are their most powerful ones.
Kinda reminds me of PaddleOCR.
Would be awesome if DeepSeek OCR could be integrated into a mobile app someday. That’d make OCR way more convenient!
How does this fair with the Vidore benchmark?
https://huggingface.co/spaces/vidore/vidore-leaderboard
The only model that I found so far that extract table data with OCR is dots.ocr. Models that came after it have not done a good job. Interesting on testing this new model.
Language support is not mentioned in the repo. But from the paper, it offers extensive multilingual support (nearly 100 languages) which is good, but I need to test it to see how it compares to Gemini and Mistral OCR.
Google translated this to "Worry about the world first" while Bing says "Worry before the worries of the world."
Can anyone shed some light on this saying or why it's in the article?
Instead of downloading a specific OCR model how would one fare just downloading the currently best multi-modal foundation model? And what would that be at less than 30 GB?
Great work guys, how about we replace the global encoder with a Mamba (state-space) vision backbone to eliminate the O(n²) attention bottleneck, enabling linear-complexity encoding of high-resolution documents. Pair this with a non-autoregressive (Non-AR) decoder—such as Mask-Predict or iterative refinement—that generates all output tokens in parallel instead of sequentially. Together, this creates a fully parallelizable vision-to-text pipeline, The combination addresses both major bottlenecks in DeepSeek-OCR.
Have yall seen tensorlake? I’m curious how this compares to a model custom built for the problem. My guess is it can be as good. But can it be as efficient?
disclaimer: I do not work for tensorlake—but i know the folks behind it.
This could be great for extracting text from old magazines; traditional OCR gives you a bit of a mess you have to clean up, but this looks like it can properly identify columns and track the flow accurately (and extract images!) It appears it can convert magazine layouts to markdown too
How good is this compared to most commercial OCR software?
Please a bit of education, what does it do?
Looks great, but looking at the benchmark, can’t help but think about how crazy good dots-ocr is as a model. Too bad they’re not as open as the Deepseek team because its so crazy good and would love to know how it was trained.
OCR is not a great name for these models. While they can do traditional OCR such as digitize and scanned PDF for example, they do so much more.
> [2025/x/x] We release DeepSeek-OCR, a model to investigate the role of vision encoders from an LLM-centric viewpoint.
So close but it should be 2025/X/XX as "X" = 10 in Roman Numerals /s
Jokes aside, this is really neat and I'm looking forward to getting this running. For most OCR-type stuff I just use AWS Textract since I need it so rarely and that service does a decent job. I really like how well this model seems to extract images/figures as well from the original document.