Skip to content
LIVE WEBINAR

Seeing Is Believing: Training an Open-Source Grounded OCR VLM

February 10, 2026  |  2pm ET  |  1 hour

 

Traditional OCR engines do output locations – but they're often brittle across domains, layouts, and languages, requiring fragile post-processing to stay accurate. Vision-language models (VLMs) promise far better flexibility and transfer, yet many current OCR VLMs still falter on provenance: they can "read," but can't consistently show where a value came from without generating long, costly page-wide outputs. In this session, we'll show how a compact, open VLM can deliver reliable, line/word-level grounding – answering "what does it say here?" and "where is X?" with precise boxes and reproducible behavior.  

We'll walk through the end-to-end recipe behind GutenOCR (a fine-tune of Qwen2.5-VL): data and synthetic grounding signals, the prompting/system-prompt design that enforces strict output formats, the training stack and hardware profile, and how we evaluate reading, detection, and grounding (not just text accuracy). Expect candid lessons on multi-column layouts and complex tables, plus open code (including our vLLM eval harness) so you can reproduce results or adapt the approach. 

 

Register to learn how to:

  • Decide when a grounded VLM beats classic OCR or heavy general models – covering brittleness vs. flexibility, cost/latency, and on-prem constraints
  • Build the data recipe for grounding and design prompts/system prompts that yield strict text+bbox outputs
  • Fine-tune a compact VLM for reading, localized reading, detection, and conditional "find-this-string" grounding – without brittle block segmentation
  • Evaluate what matters in production: CER/WER, bbox P/R/F1, IoU, and hit@k for search on public and in-domain sets
  • Handle tricky documents – multi-column pages, merged/wrapped table cells – with region-scoped reads, label anchors, and lightweight post-processing  

Speakers

Hunter Heidenreich
Hunter Heidenreich
AI Research Scientist

Hunter is a Research Scientist at Roots, focused on production-grade document AI for the insurance industry. He leads work on page segmentation, interpretable model calibration, and reasoning-based models, and holds a BS from Drexel and an MS from Harvard.

Ben Elliott
Ben Elliott
AI Engineer

Ben researches post-training at Roots, and graduated Harvard in 2023. Outside of machine learning, he likes to write poems.

Yosheb Getachew
Yosheb Getachew
AI Engineer

Yosh completed his undergraduate and master’s studies at Stanford. He currently researches post-training at Roots. In his free time, he enjoys designing and creating clothes, volunteering with animals, and watching anime.

Seeing Is Believing: Training an Open-Source Grounded OCR VLM

Register Now