When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.

翻译：检索增强生成（RAG）将大语言模型（LLM）的能力扩展至参数化知识之外，然而目前尚不清楚在何种情况下迭代式检索-推理循环能显著优于静态RAG，尤其是在涉及多跳推理、稀疏领域知识和异构证据的科学领域。我们首次通过受控的机制层面诊断研究，探讨同步迭代检索与推理能否超越理想化的静态上限（黄金上下文）RAG。我们在三种机制下对十一个前沿LLM进行基准测试：（i）无上下文，测量模型对参数记忆的依赖；（ii）黄金上下文，一次性提供所有理想证据；（iii）迭代式RAG，一种无需训练的控制器，交替执行检索、假设优化和基于证据的停止判断。通过使用聚焦化学领域的ChemKGMultiHopQA数据集，我们筛选出需要真实检索的问题，并利用涵盖检索覆盖缺口、锚点携带丢失、查询质量、组合保真度和控制校准的诊断方法分析模型行为。在所有模型中，迭代式RAG始终优于黄金上下文，最高提升达25.6个百分点，尤其对于未经推理微调的模型效果显著。分阶段检索能减少后期跳转失败、缓解上下文过载，并支持动态修正早期假设漂移，但现存失败模式包括跳转覆盖不完整、干扰项锁定轨迹、提前停止校准偏差，以及即使在完美检索条件下仍存在的高组合失败率。总体而言，分阶段检索往往比单纯提供理想证据更具影响力；我们为在专业科学场景中部署和诊断RAG系统提供了实用指导，并为构建更可靠、可控的迭代检索-推理框架奠定了基础。