When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.

翻译：检索增强生成（RAG）将大语言模型（LLM）的能力扩展到参数化知识之外，但尚不清楚在何种情况下迭代式检索-推理循环能显著优于静态RAG，尤其是在需要多跳推理、领域知识稀疏且证据异构的科学领域。我们首次开展了机制层面的受控诊断研究，探究同步迭代检索与推理能否超越理想化的静态上限（黄金上下文）RAG。我们在三种机制下对十一个前沿LLM进行基准测试：（i）无上下文，测量对参数记忆的依赖；（ii）黄金上下文，一次性提供所有理想证据；（iii）迭代式RAG，一种免训练的控制器，交替执行检索、假设优化和基于证据的停止判断。利用聚焦化学领域的ChemKGMultiHopQA数据集，我们筛选出需要真实检索的问题，并通过涵盖检索覆盖缺口、锚点携带丢失、查询质量、组合保真度和控制校准的诊断指标进行分析。在所有模型中，迭代式RAG始终优于黄金上下文，最高提升达25.6个百分点，尤其对于未经推理微调的模型效果显著。分阶段检索减少了后跳失败，缓解了上下文过载，并能动态修正早期假设漂移，但残余失败模式包括跳步覆盖不全、干扰项锁定轨迹、早停校准偏差，以及即使检索完美仍存在的高组合失败率。总体而言，分阶段检索往往比单纯提供理想证据更具影响力；我们为在专业科学场景中部署和诊断RAG系统提供了实用指导，并为构建更可靠、可控的迭代检索-推理框架奠定了基础。