Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.
翻译:将扫描文档自动解析为结构丰富、机器可读的格式,仍然是文档智能领域的关键瓶颈,因为传统的多阶段流程存在错误传播问题,且对多样化布局的适应性有限。我们提出了layoutRL,一种端到端的强化学习框架,该框架通过优化归一化编辑距离、段落数量准确性和阅读顺序保持度的复合奖励,训练模型显式地具备布局感知能力。基于我们新发布的数据集Infinity-Doc-55K(该数据集融合了5.5万份高保真合成扫描文档解析数据与经过专家筛选的真实文档),我们将layoutRL实例化为一个基于视觉语言模型的解析器,称为Infinity-Parser。在针对英文和中文的OCR、表格与公式提取以及阅读顺序检测的基准测试中,Infinity-Parser在准确性和结构保真度方面均取得了新的最先进性能,超越了专用流程和通用视觉语言模型。我们将公开代码和数据集,以加速鲁棒文档理解领域的研究进展。