The expansion of retrieval-augmented generation (RAG) into multimodal domains has intensified the challenge for processing complex visual documents, such as financial reports. While page-level chunking and retrieval is a natural starting point, it creates a critical bottleneck: delivering entire pages to the generator introduces excessive extraneous context. This not only overloads the generator's attention mechanism but also dilutes the most salient evidence. Moreover, compressing these information-rich pages into a limited visual token budget further increases the risk of hallucinations. To address this, we introduce AgenticOCR, a dynamic parsing paradigm that transforms optical character recognition (OCR) from a static, full-text process into a query-driven, on-demand extraction system. By autonomously analyzing document layout in a "thinking with images" manner, AgenticOCR identifies and selectively recognizes regions of interest. This approach performs on-demand decompression of visual tokens precisely where needed, effectively decoupling retrieval granularity from rigid page-level chunking. AgenticOCR has the potential to serve as the "third building block" of the visual document RAG stack, operating alongside and enhancing standard Embedding and Reranking modules. Experimental results demonstrate that AgenticOCR improves both the efficiency and accuracy of visual RAG systems, achieving expert-level performance in long document understanding. Code and models are available at https://github.com/OpenDataLab/AgenticOCR.
翻译:检索增强生成(RAG)向多模态领域的扩展加剧了处理复杂视觉文档(如财务报告)的挑战。虽然页面级分块与检索是自然的起点,但它造成了关键瓶颈:将完整页面传递给生成器会引入大量无关上下文。这不仅使生成器的注意力机制过载,也稀释了最关键的证据。此外,将这些信息密集的页面压缩到有限的视觉令牌预算中进一步增加了幻觉风险。为此,我们提出AgenticOCR,一种动态解析范式,将光学字符识别(OCR)从静态的全文本处理转变为查询驱动的按需提取系统。通过以“图像思维”方式自主分析文档布局,AgenticOCR能够识别并选择性读取感兴趣区域。该方法在所需位置精确执行视觉令牌的按需解压缩,有效将检索粒度与僵化的页面级分块解耦。AgenticOCR有潜力成为视觉文档RAG栈的“第三构建模块”,与标准的嵌入和重排序模块协同工作并增强其效能。实验结果表明,AgenticOCR在提升视觉RAG系统效率的同时也提高了准确性,在长文档理解任务中实现了专家级性能。代码与模型已发布于https://github.com/OpenDataLab/AgenticOCR。