Qwen2.5-VL-72b was released two months ago (to little fanfare in submissions, i think, but some very enthusiastic comments such as rabid enthusiasm for handwriting recognition) already very interesting. Its actually one of the releases that kind of turned me on to AI, that broke through some of my skepticism & grumpiness. There's pretty good release notes detailing capabilities here; well done blog post. https://qwenlm.github.io/blog/qwen2.5-vl/
One thing that really piqued my interest was Qwen HTML output, where it can provide bounding boxes in HTML format for its output. That really closes the loop interestingly to me, makes the output something I can imagine quickly building useful visual feedback around, or using the structured data from easily. I can't imagine an easier to use output format.
ks2048
I suppose none of these models can output bounding box coordinates for extracted text? That seems to be a big advantage of traditional OCR over LLMs.
For applications I'm interested in, until we can get to 95+% accuracy, it will require human double-checking / corrections, which seems unfeasible w/o bounding boxes to quickly check for errors.
show comments
pmarreck
Downloading the MLX version of "Qwen2.5-VL-32b-Instruct -8bit" via LM Studio right now since it's not yet available on Ollama and I can run it locally... I have an OCR side project for it to work on, want to see how performant it is on my M4... will report back
show comments
daemonologist
You mention that you measured cost and latency in addition to accuracy - would you be willing to share those results as well? (I understand that for these open models they would vary between providers, but it would be useful to have an approximate baseline.)
show comments
fpgaminer
I've been consistently surprised by Gemini's OCR capabilities. And yeah, Qwen is climbing the vision ladder _fast_.
In my workflows I often have multiple models competing side-by-side, so I get to compare the same task executed on, say, 4o, Gemini, and Qwen. And I deal with a very wide range of vision related tasks. The newest Qwen models are not only overall better than their previous release by a good margin, but also much more stable (less prone to glitching) and easier to finetune. I'm not at all surprised they're topping the OCR benchmark.
What bugs me though is OpenAI. Outside of OCR, 4o is still king in terms of overall understanding of images. But 4o is now almost a year old, and in all that time they have neither improved the vision performance in any newer releases, nor have they improved OCR. OpenAI's OCR has been bad for a long time, and it's both odd and annoying.
Taken with a grain of salt since again I've only had it in my workflow for about a week or two, but I'd say Qwen 2.5 VL 72b beats Gemini for general vision. That lands it in second place for me. And it can be run _locally_. That's nuts. I'm going to laugh if Qwen drops another iteration in a couple months that beats 4o.
ks2048
I've been doing some experiments with the OCR API on macOS lately and wonder how it compares to these LLMs.
Overall, it's very impressive, but makes some mistakes (on easy images - i.e. obviously wrong) that require human intervention.
I would like to compare it to these models, but this benchmark is beyond OCR - extracted structured JSON.
AndrewDucker
Tesseract can manage 99% accuracy on anything other than handwriting. Without being an LLM.
Is there an advantage of using an LLM here?
show comments
CSMastermind
I've been very impressed with Qwen in my testing, I think people are underestimating it
show comments
WillAdams
How does one configure an LLM interface using this to process multiple files with a single prompt?
show comments
codybontecou
Nice work Tyler and team!
ianhawes
Is there a reason Surya isn’t included?
sandreas
What about mini cpm v2.6?
azinman2
News update: OCR company touts new benchmark that shows its own products are the most performant.
The 32b sounds like it has some useful small tweakers. Tweaks to make output more human friendly, better mathematical reasoning, better fine-grained understanding. https://qwenlm.github.io/blog/qwen2.5-vl-32b/ https://news.ycombinator.com/item?id=43464068
Qwen2.5-VL-72b was released two months ago (to little fanfare in submissions, i think, but some very enthusiastic comments such as rabid enthusiasm for handwriting recognition) already very interesting. Its actually one of the releases that kind of turned me on to AI, that broke through some of my skepticism & grumpiness. There's pretty good release notes detailing capabilities here; well done blog post. https://qwenlm.github.io/blog/qwen2.5-vl/
One thing that really piqued my interest was Qwen HTML output, where it can provide bounding boxes in HTML format for its output. That really closes the loop interestingly to me, makes the output something I can imagine quickly building useful visual feedback around, or using the structured data from easily. I can't imagine an easier to use output format.
I suppose none of these models can output bounding box coordinates for extracted text? That seems to be a big advantage of traditional OCR over LLMs.
For applications I'm interested in, until we can get to 95+% accuracy, it will require human double-checking / corrections, which seems unfeasible w/o bounding boxes to quickly check for errors.
Downloading the MLX version of "Qwen2.5-VL-32b-Instruct -8bit" via LM Studio right now since it's not yet available on Ollama and I can run it locally... I have an OCR side project for it to work on, want to see how performant it is on my M4... will report back
You mention that you measured cost and latency in addition to accuracy - would you be willing to share those results as well? (I understand that for these open models they would vary between providers, but it would be useful to have an approximate baseline.)
I've been consistently surprised by Gemini's OCR capabilities. And yeah, Qwen is climbing the vision ladder _fast_.
In my workflows I often have multiple models competing side-by-side, so I get to compare the same task executed on, say, 4o, Gemini, and Qwen. And I deal with a very wide range of vision related tasks. The newest Qwen models are not only overall better than their previous release by a good margin, but also much more stable (less prone to glitching) and easier to finetune. I'm not at all surprised they're topping the OCR benchmark.
What bugs me though is OpenAI. Outside of OCR, 4o is still king in terms of overall understanding of images. But 4o is now almost a year old, and in all that time they have neither improved the vision performance in any newer releases, nor have they improved OCR. OpenAI's OCR has been bad for a long time, and it's both odd and annoying.
Taken with a grain of salt since again I've only had it in my workflow for about a week or two, but I'd say Qwen 2.5 VL 72b beats Gemini for general vision. That lands it in second place for me. And it can be run _locally_. That's nuts. I'm going to laugh if Qwen drops another iteration in a couple months that beats 4o.
I've been doing some experiments with the OCR API on macOS lately and wonder how it compares to these LLMs.
Overall, it's very impressive, but makes some mistakes (on easy images - i.e. obviously wrong) that require human intervention.
I would like to compare it to these models, but this benchmark is beyond OCR - extracted structured JSON.
Tesseract can manage 99% accuracy on anything other than handwriting. Without being an LLM.
Is there an advantage of using an LLM here?
I've been very impressed with Qwen in my testing, I think people are underestimating it
How does one configure an LLM interface using this to process multiple files with a single prompt?
Nice work Tyler and team!
Is there a reason Surya isn’t included?
What about mini cpm v2.6?
News update: OCR company touts new benchmark that shows its own products are the most performant.