AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG

With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction. Multi-hop reasoning, which requires models to engage in deliberate thinking and multi-step interaction, serves as a critical testbed for assessing such capabilities. However, existing benchmarks typically provide only final questions and answers, while lacking the intermediate hop-level questions that gradually connect atomic questions to the final multi-hop query. This limitation prevents researchers from analyzing at which step an agent fails and restricts more fine-grained evaluation of model capabilities. Moreover, most current benchmarks are manually constructed, which is both time-consuming and labor-intensive, while also limiting scalability and generalization. To address these challenges, we introduce AgenticRAGTracer, the first Agentic RAG benchmark that is primarily constructed automatically by large language models and designed to support step-by-step validation. Our benchmark spans multiple domains, contains 1,305 data points, and has no overlap with existing mainstream benchmarks. Extensive experiments demonstrate that even the best large language models perform poorly on our dataset. For instance, GPT-5 attains merely 22.6\% EM accuracy on the hardest portion of our dataset. Hop-aware diagnosis reveals that failures are primarily driven by distorted reasoning chains -- either collapsing prematurely or wandering into over-extension. This highlights a critical inability to allocate steps consistent with the task's logical structure, providing a diagnostic dimension missing in traditional evaluations. We believe our work will facilitate research in Agentic RAG and inspire further meaningful progress in this area. Our code and data are available at https://github.com/YqjMartin/AgenticRAGTracer.

翻译：近年来，随着基于智能体方法的快速发展，智能体化RAG无疑已成为重要的研究方向。多跳推理要求模型进行审慎思考与多步交互，是评估此类能力的关键测试平台。然而，现有基准通常仅提供最终问题与答案，缺乏将原子问题逐步连接至最终多跳查询的中间跳级问题。这一局限使研究者无法分析智能体在哪个步骤失败，并限制了对模型能力进行更细粒度评估的可能性。此外，当前多数基准依赖人工构建，耗时耗力且可扩展性与泛化能力受限。为应对这些挑战，我们提出了AgenticRAGTracer——首个主要由大语言模型自动构建、支持逐步验证的智能体化RAG基准。我们的基准涵盖多领域，包含1,305个数据点，且与现有主流基准无重叠。大量实验表明，即使最优的大语言模型在我们的数据集上表现亦不理想。例如，GPT-5在我们数据集最困难部分仅获得22.6%的精确匹配准确率。跳数感知诊断揭示，失败主要由扭曲的推理链驱动——要么过早坍缩，要么陷入过度延伸。这凸显出现有模型在分配符合任务逻辑结构的推理步骤方面存在关键缺陷，提供了传统评估缺失的诊断维度。我们相信本工作将推动智能体化RAG领域的研究，并激发该领域更有意义的进展。代码与数据公开于：https://github.com/YqjMartin/AgenticRAGTracer。