With the rapid progress of tool-using and agentic large language models (LLMs), Retrieval-Augmented Generation (RAG) is evolving from one-shot, passive retrieval into multi-turn, decision-driven evidence acquisition. Despite strong results in open-domain settings, existing agentic search frameworks commonly treat long documents as flat collections of chunks, underutilizing document-native priors such as hierarchical organization and sequential discourse structure. We introduce DeepRead, a structure-aware, multi-turn document reasoning agent that explicitly operationalizes these priors for long-document question answering. DeepRead leverages LLM-based OCR model to convert PDFs into structured Markdown that preserves headings and paragraph boundaries. It then indexes documents at the paragraph level and assigns each paragraph a coordinate-style metadata key encoding its section identity and in-section order. Building on this representation, DeepRead equips the LLM with two complementary tools: a Retrieve tool that localizes relevant paragraphs while exposing their structural coordinates (with lightweight scanning context), and a ReadSection tool that enables contiguous, order-preserving reading within a specified section and paragraph range. Our experiments demonstrate that DeepRead achieves significant improvements over Search-o1-style agentic search in document question answering. The synergistic effect between retrieval and reading tools is also validated. Our fine-grained behavioral analysis reveals a reading and reasoning paradigm resembling human-like ``locate then read'' behavior.
翻译:随着工具调用与智能体化大语言模型(LLM)的快速发展,检索增强生成(RAG)正从单次、被动的检索演变为多轮、决策驱动的证据获取。尽管在开放域场景中取得了显著成果,现有的智能搜索框架通常将长文档视为扁平的文本块集合,未能充分利用文档固有的先验信息,如层次化组织与序列化篇章结构。本文提出DeepRead,一种结构感知的多轮文档推理智能体,它显式地将这些先验信息应用于长文档问答任务。DeepRead利用基于LLM的OCR模型将PDF转换为保留标题与段落边界的结构化Markdown格式。随后,它在段落级别对文档建立索引,并为每个段落分配一个坐标式元数据键,该键编码了其所属章节的身份及在章节内的顺序。基于此表示,DeepRead为LLM配备了两种互补的工具:一个检索工具,用于定位相关段落并同时暴露其结构坐标(附带轻量级的扫描上下文);以及一个阅读章节工具,支持在指定章节和段落范围内进行连续、保序的阅读。实验表明,DeepRead在文档问答任务上相比Search-o1风格的智能体搜索取得了显著提升。检索与阅读工具之间的协同效应也得到了验证。我们细粒度的行为分析揭示了一种类似于人类“定位后阅读”行为的阅读与推理范式。