Show HN: Qwen-2.5-32B is now the best open source OCR model

jauntywundrkind

The 32b sounds like it has some useful small tweakers. Tweaks to make output more human friendly, better mathematical reasoning, better fine-grained understanding. https://qwenlm.github.io/blog/qwen2.5-vl-32b/ https://news.ycombinator.com/item?id=43464068

Qwen2.5-VL-72b was released two months ago (to little fanfare in submissions, i think, but some very enthusiastic comments such as rabid enthusiasm for handwriting recognition) already very interesting. Its actually one of the releases that kind of turned me on to AI, that broke through some of my skepticism & grumpiness. There's pretty good release notes detailing capabilities here; well done blog post. https://qwenlm.github.io/blog/qwen2.5-vl/

One thing that really piqued my interest was Qwen HTML output, where it can provide bounding boxes in HTML format for its output. That really closes the loop interestingly to me, makes the output something I can imagine quickly building useful visual feedback around, or using the structured data from easily. I can't imagine an easier to use output format.

ks2048

I suppose none of these models can output bounding box coordinates for extracted text? That seems to be a big advantage of traditional OCR over LLMs.

For applications I'm interested in, until we can get to 95+% accuracy, it will require human double-checking / corrections, which seems unfeasible w/o bounding boxes to quickly check for errors.

show comments

pmarreck

Downloading the MLX version of "Qwen2.5-VL-32b-Instruct -8bit" via LM Studio right now since it's not yet available on Ollama and I can run it locally... I have an OCR side project for it to work on, want to see how performant it is on my M4... will report back

show comments

daemonologist

You mention that you measured cost and latency in addition to accuracy - would you be willing to share those results as well? (I understand that for these open models they would vary between providers, but it would be useful to have an approximate baseline.)

show comments

fpgaminer

I've been consistently surprised by Gemini's OCR capabilities. And yeah, Qwen is climbing the vision ladder _fast_.

In my workflows I often have multiple models competing side-by-side, so I get to compare the same task executed on, say, 4o, Gemini, and Qwen. And I deal with a very wide range of vision related tasks. The newest Qwen models are not only overall better than their previous release by a good margin, but also much more stable (less prone to glitching) and easier to finetune. I'm not at all surprised they're topping the OCR benchmark.

What bugs me though is OpenAI. Outside of OCR, 4o is still king in terms of overall understanding of images. But 4o is now almost a year old, and in all that time they have neither improved the vision performance in any newer releases, nor have they improved OCR. OpenAI's OCR has been bad for a long time, and it's both odd and annoying.

Taken with a grain of salt since again I've only had it in my workflow for about a week or two, but I'd say Qwen 2.5 VL 72b beats Gemini for general vision. That lands it in second place for me. And it can be run _locally_. That's nuts. I'm going to laugh if Qwen drops another iteration in a couple months that beats 4o.

ks2048

I've been doing some experiments with the OCR API on macOS lately and wonder how it compares to these LLMs.

Overall, it's very impressive, but makes some mistakes (on easy images - i.e. obviously wrong) that require human intervention.

I would like to compare it to these models, but this benchmark is beyond OCR - extracted structured JSON.

AndrewDucker

Tesseract can manage 99% accuracy on anything other than handwriting. Without being an LLM.

Is there an advantage of using an LLM here?

show comments

CSMastermind

I've been very impressed with Qwen in my testing, I think people are underestimating it

show comments

WillAdams

How does one configure an LLM interface using this to process multiple files with a single prompt?

show comments

codybontecou

Nice work Tyler and team!

ianhawes

Is there a reason Surya isn’t included?

sandreas

What about mini cpm v2.6?

azinman2

News update: OCR company touts new benchmark that shows its own products are the most performant.

show comments