Teams building document automation often hit the same wall: you can’t reliably extract what you can’t see. Many OCR systems either struggle on messy layouts or produce text without reliable grounding, controllable region reads, or easy human verification.
To bridge that gap, we built GutenOCR: a grounded OCR front-end created by fine-tuning Qwen2.5-VL (3B and 7B) into a single model that supports full-page reading, text detection, localized reading, and “where is X?” queries.
On 10.5K held-out business and scientific pages, GutenOCR-7B more than doubled the composite grounded OCR score of its backbone (0.40 → 0.82), with major gains in detection and region-level reading.

The Problem with “Just OCR” in Production
In real workflows (claims, invoices, scientific PDFs, forms, and reports), OCR isn’t a one-shot “convert page to text” step. It’s the front-end for everything downstream: extraction, retrieval, QA, and human review. When OCR misses text regions, scrambles reading order, or can’t explain where a string came from, the rest of the pipeline inherits those failures.
Two families of approaches dominate today:
- Classical OCR pipelines provide boxes and control but can be brittle on complex layouts and noisy scans.
- Vision–language models can read broadly but often lack stable grounding, making review and targeted re-reading harder.
What production teams need is a grounded OCR front-end: something that produces text and reliably attaches it to pixels, with the ability to read arbitrary regions on demand.
Introducing GutenOCR: A Unified Grounded Front-End
GutenOCR is built to feel familiar if you’ve used traditional OCR while bringing the adaptability of modern vision–language models. Instead of stitching together multiple components, a single GutenOCR model can handle the most common “front-door” OCR needs through simple prompts:
- Read the whole page (plain text or layout-aware text)
- Find text on the page (return locations of lines/blocks)
- Search for specific content (“where is X?”)
- Read a specific region (re-read only what’s highlighted)
Why does this matter? Because it gives teams a flexible, reliable OCR interface that can plug into many workflows. You can index content, extract key fields, highlight evidence for reviewers, or re-read just the part of a document that matters without changing tools or forcing downstream systems to adapt to a one-size-fits-all output format.

How We Trained GutenOCR (and Why It Works)
We fine-tuned Qwen2.5-VL-3B and Qwen2.5-VL-7B into OCR-specialized models without changing tokenizers, freezing modules, or adding adapters. The goal wasn’t to create the “best Markdown generator.” It was to teach a general-purpose VLM to become a reliable OCR primitive layer.
The training mixture combines:
- Real business documents (noisy forms, IDs, invoices, claims)
- Scientific articles (multi-column layouts, long contexts, symbols, captions)
- Synthetic grounding data targeted at line geometry and math region supervision
We also used:
- Prompt variation (so behaviors are robust to phrasing differences)
- A length-based curriculum, gradually extending context to support long page transcripts (up to ~8K-16K tokens depending on stage)
This combination pushes the model toward the behaviors production systems need most: high text-region recall, stable structured outputs, and controllable re-reading.
What We Saw in Evaluation: Gig Grounding Gains, Clear Trade-Offs
We tested GutenOCR on three fronts: held-out business and scientific pages, plus two public benchmarks (Fox and OmniDocBench v1.5). The takeaway is simple: grounding got much better, and the remaining gaps are clear and actionable.
In-domain: business + scientific documents
On 10.5K held-out pages, GutenOCR was consistently better at the things production workflows depend on: finding text, re-reading specific regions accurately, and locating phrases on demand (“where is X?”).
Fox: fine-grained “focus anywhere” OCR
Fox is a stress test for interactive OCR. GutenOCR made big strides on region- and line-level reading, which is exactly the behavior we set out to improve. The trade-offs showed up in page-level ordering (it follows layout cues that don’t always match Fox’s expected sequence) and color-guided prompts, where performance drops because those cues weren’t a focus in training.
OmniDocBench v1.5: out-of-domain stress test
OmniDocBench is broad and visually diverse, and it helped validate that GutenOCR’s detection skills generalize. We saw a major boost in text detection coverage, but also clear weaknesses on formula-heavy pages and busy, multi-colored backgrounds – areas we’re targeting next with more specialized data and training signals.

Why This Matters for Real Document Systems
Grounding changes the economics of review and iteration:
- Human verification becomes faster: Hallucinated text shows up as boxes over empty regions; missing text becomes visible gaps in coverage.
- Targeted re-reading becomes reliable: Downstream systems can “zoom in” and reread just the area that matters.
- Provenance becomes first-class: Extracted values can be traced to specific spans on the page.
In practice, this is the difference between “an OCR transcript you hope is right” and a verifiable, evidence-bearing front-end that supports active learning and exception workflows.
Toward Verifiable, Controllable Document Understanding
GutenOCR is our step toward an OCR layer that’s modern in capability but classical in contract: predictable primitives, stable interfaces, and explicit grounding. The results show that a general-purpose vision–language model can be trained into a strong grounded OCR front-end – dramatically improving detection and localized reading – while also surfacing the trade-offs that matter (reading-order assumptions, color cues, formula-heavy layouts).
Next, we’re focused on extending the recipe to be more math-aware, more color-robust, and more structure-aware (especially for tables), without sacrificing the grounding and controllability that make GutenOCR useful in production.
If your document workflows depend on trustworthy text, targeted rereads, and easy human verification, grounded OCR isn’t a “nice to have,” it’s the foundation.