LIVE WEBINAR
Seeing Is Believing: Training an Open-Source Grounded OCR VLM
February 10, 2026 | 2pm ET | 1 hour
Traditional OCR engines do output locations – but they're often brittle across domains, layouts, and languages, requiring fragile post-processing to stay accurate. Vision-language models (VLMs) promise far better flexibility and transfer, yet many current OCR VLMs still falter on provenance: they can "read," but can't consistently show where a value came from without generating long, costly page-wide outputs. In this session, we'll show how a compact, open VLM can deliver reliable, line/word-level grounding – answering "what does it say here?" and "where is X?" with precise boxes and reproducible behavior.
We'll walk through the end-to-end recipe behind GutenOCR (a fine-tune of Qwen2.5-VL): data and synthetic grounding signals, the prompting/system-prompt design that enforces strict output formats, the training stack and hardware profile, and how we evaluate reading, detection, and grounding (not just text accuracy). Expect candid lessons on multi-column layouts and complex tables, plus open code (including our vLLM eval harness) so you can reproduce results or adapt the approach.
Register to learn how to:
- Decide when a grounded VLM beats classic OCR or heavy general models – covering brittleness vs. flexibility, cost/latency, and on-prem constraints
- Build the data recipe for grounding and design prompts/system prompts that yield strict text+bbox outputs
- Fine-tune a compact VLM for reading, localized reading, detection, and conditional "find-this-string" grounding – without brittle block segmentation
- Evaluate what matters in production: CER/WER, bbox P/R/F1, IoU, and hit@k for search on public and in-domain sets
- Handle tricky documents – multi-column pages, merged/wrapped table cells – with region-scoped reads, label anchors, and lightweight post-processing
Speakers
