Teams building document automation often hit the same wall: you can’t reliably extract what you can’t see. Many OCR systems either struggle on messy layouts or produce text without reliable grounding, controllable region reads, or easy human verification.
To bridge that gap, we built GutenOCR: a grounded OCR front-end created by fine-tuning Qwen2.5-VL (3B and 7B) into a single model that supports full-page reading, text detection, localized reading, and “where is X?” queries.
On 10.5K held-out business and scientific pages, GutenOCR-7B more than doubled the composite grounded OCR score of its backbone (0.40 → 0.82), with major gains in detection and region-level reading.
In real workflows (claims, invoices, scientific PDFs, forms, and reports), OCR isn’t a one-shot “convert page to text” step. It’s the front-end for everything downstream: extraction, retrieval, QA, and human review. When OCR misses text regions, scrambles reading order, or can’t explain where a string came from, the rest of the pipeline inherits those failures.
Two families of approaches dominate today:
What production teams need is a grounded OCR front-end: something that produces text and reliably attaches it to pixels, with the ability to read arbitrary regions on demand.
GutenOCR is built to feel familiar if you’ve used traditional OCR while bringing the adaptability of modern vision–language models. Instead of stitching together multiple components, a single GutenOCR model can handle the most common “front-door” OCR needs through simple prompts:
Why does this matter? Because it gives teams a flexible, reliable OCR interface that can plug into many workflows. You can index content, extract key fields, highlight evidence for reviewers, or re-read just the part of a document that matters without changing tools or forcing downstream systems to adapt to a one-size-fits-all output format.
We fine-tuned Qwen2.5-VL-3B and Qwen2.5-VL-7B into OCR-specialized models without changing tokenizers, freezing modules, or adding adapters. The goal wasn’t to create the “best Markdown generator.” It was to teach a general-purpose VLM to become a reliable OCR primitive layer.
The training mixture combines:
We also used:
This combination pushes the model toward the behaviors production systems need most: high text-region recall, stable structured outputs, and controllable re-reading.
We tested GutenOCR on three fronts: held-out business and scientific pages, plus two public benchmarks (Fox and OmniDocBench v1.5). The takeaway is simple: grounding got much better, and the remaining gaps are clear and actionable.
On 10.5K held-out pages, GutenOCR was consistently better at the things production workflows depend on: finding text, re-reading specific regions accurately, and locating phrases on demand (“where is X?”).
Fox is a stress test for interactive OCR. GutenOCR made big strides on region- and line-level reading, which is exactly the behavior we set out to improve. The trade-offs showed up in page-level ordering (it follows layout cues that don’t always match Fox’s expected sequence) and color-guided prompts, where performance drops because those cues weren’t a focus in training.
OmniDocBench is broad and visually diverse, and it helped validate that GutenOCR’s detection skills generalize. We saw a major boost in text detection coverage, but also clear weaknesses on formula-heavy pages and busy, multi-colored backgrounds – areas we’re targeting next with more specialized data and training signals.
Grounding changes the economics of review and iteration:
In practice, this is the difference between “an OCR transcript you hope is right” and a verifiable, evidence-bearing front-end that supports active learning and exception workflows.
GutenOCR is our step toward an OCR layer that’s modern in capability but classical in contract: predictable primitives, stable interfaces, and explicit grounding. The results show that a general-purpose vision–language model can be trained into a strong grounded OCR front-end – dramatically improving detection and localized reading – while also surfacing the trade-offs that matter (reading-order assumptions, color cues, formula-heavy layouts).
Next, we’re focused on extending the recipe to be more math-aware, more color-robust, and more structure-aware (especially for tables), without sacrificing the grounding and controllability that make GutenOCR useful in production.
If your document workflows depend on trustworthy text, targeted rereads, and easy human verification, grounded OCR isn’t a “nice to have,” it’s the foundation.