GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and synthetic grounding data, the models support full-page and localized reading with line- and paragraph-level bounding boxes and conditional ``where is x?'' queries. We introduce a grounded OCR evaluation protocol and show that GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone on 10.5K held-out business and scientific pages (0.40 to 0.82). On Fox and OmniDocBench v1.5, our approach substantially improves region- and line-level OCR as well as text-detection recall, but reveals trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts.
翻译:GutenOCR是一个通过微调Qwen2.5-VL-3B和Qwen2.5-VL-7B获得的、基于视觉语言模型的OCR前端系列。所得的单检查点视觉语言模型通过统一的、基于提示的接口,实现了文本识别、检测与定位功能。该模型在商业文档、科学论文及合成的定位数据上进行训练,支持整页及局部区域的文本识别,并提供行级与段落级边界框以及条件式“x在哪里?”查询功能。我们提出了一种基于定位的OCR评估方案,并证明在10.5K份留出的商业与科学文档页面上,GutenOCR-7B的综合定位OCR分数相比其骨干模型Qwen2.5-VL-7B提升了一倍以上(从0.40提升至0.82)。在Fox和OmniDocBench v1.5基准测试中,我们的方法显著提升了区域级与行级OCR性能以及文本检测召回率,但也揭示了其在页面级线性化、颜色引导OCR以及公式密集版块处理方面存在的权衡。