In this paper, we propose Agentar-Fin-OCR, a document parsing system tailored to financial-domain documents, transforming ultra-long financial PDFs into semantically consistent, highly accurate, structured outputs with auditing-grade provenance. To address finance-specific challenges such as complex layouts, cross-page structural discontinuities, and cell-level referencing capability, Agentar-Fin-OCR combines (1) a Cross-page Contents Consolidation algorithm to restore continuity across pages and a Document-level Heading Hierarchy Reconstruction (DHR) module to build a globally consistent Table of Contents (TOC) tree for structure-aware retrieval, and (2) a difficulty-adaptive curriculum learning training strategy for table parsing, together with a CellBBoxRegressor module that uses structural anchor tokens to localize table cells from decoder hidden states without external detectors. Experiments demonstrate that our model shows high performance on the table parsing metrics of OmniDocBench. To enable realistic evaluation in the financial vertical, we further introduce FinDocBench, a benchmark that includes six financial document categories with expert-verified annotations and evaluation metrics including Table of Contents edit-distance-based similarity (TocEDS), cross-page concatenated TEDS, and Table Cell Intersection over Union (C-IoU). We evaluate a wide range of state-of-the-art models on FinDocBench to assess their capabilities and remaining limitations on financial documents. Overall, Agentar-Fin-OCR and FinDocBench provide a practical foundation for reliable downstream financial document applications.
翻译:本文提出Agentar-Fin-OCR,一种专为金融领域文档设计的解析系统,能够将超长金融PDF转换为语义一致、高精度、结构化且具备审计级溯源信息的输出。为应对金融文档特有的复杂布局、跨页结构不连续以及单元格级引用能力等挑战,Agentar-Fin-OCR结合了以下技术:(1)跨页内容整合算法以恢复页面间的连续性,以及文档级标题层次重构模块,用于构建全局一致的内容目录树,实现结构感知检索;(2)面向表格解析的难度自适应课程学习训练策略,以及CellBBoxRegressor模块,该模块利用结构锚定标记从解码器隐藏状态定位表格单元格,无需外部检测器。实验表明,我们的模型在OmniDocBench的表格解析指标上表现出高性能。为在金融垂直领域实现真实评估,我们进一步引入FinDocBench基准,该基准包含六个金融文档类别,配备专家验证的标注以及基于内容目录编辑距离的相似度、跨页拼接TEDS和表格单元格交并比等评估指标。我们在FinDocBench上评估了多种先进模型,以评估其在金融文档上的能力与现存局限。总体而言,Agentar-Fin-OCR与FinDocBench为可靠的下游金融文档应用提供了实用基础。