We present \textbf{LightOnOCR-2-1B}, a 1B-parameter end-to-end multilingual vision--language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench while being 9$\times$ smaller and substantially faster than prior best-performing models. We further extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards. Finally, we improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and \textbf{LightOnOCR-bbox-bench} evaluation under their respective licenses.
翻译:我们提出了\textbf{LightOnOCR-2-1B},一个拥有10亿参数的端到端多语言视觉-语言模型,能够将文档图像(例如PDF)直接转换为干净、自然顺序的文本,而无需依赖脆弱的传统OCR流程。该模型在覆盖扫描文档、法语文档和科学PDF的大规模高质量蒸馏混合数据集上进行训练,在OlmOCR-Bench上取得了最先进的性能,同时其参数量比先前最佳性能模型小9倍,且推理速度显著更快。我们进一步扩展了输出格式,以预测文档中嵌入图像的归一化边界框,通过一种恢复策略在预训练阶段引入定位能力,并利用基于交并比(IoU)奖励的RLVR方法对其进行精炼。最后,我们通过检查点平均和任务算术合并技术提升了模型的鲁棒性。我们依据Apache 2.0许可证发布模型检查点,并依据相应许可证公开数据集及\textbf{LightOnOCR-bbox-bench}评估基准。