When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.

翻译：检索增强生成（RAG）将大语言模型（LLM）的能力扩展至参数记忆之外，但尚不明确迭代式检索-推理循环何时能实质性优于静态RAG——尤其在涉及多跳推理、领域知识稀疏和证据异构的科学领域。我们首次开展受控的机制级诊断研究，探究同步迭代式检索与推理能否超越理想化静态上界（黄金语境）RAG。在三种模式下对十一种最先进LLM进行基准测试：（i）无语境模式，评估对参数记忆的依赖程度；（ii）黄金语境模式，一次性提供所有标准答案证据；（iii）迭代式RAG模式，采用免训练控制器交替执行检索、假设修正和证据感知停止机制。基于化学领域的ChemKGMultiHopQA数据集，我们筛选出需要真实检索的问题，并通过诊断分析覆盖检索缺口、锚点传递缺失、查询质量、组合忠实度与控制校准等维度。实验表明，迭代式RAG在所有模型中持续优于黄金语境，最高提升25.6个百分点，对非推理微调模型尤为显著。分阶段检索可减少后期跳失败、缓解语境过载，并实现早期假设漂移的动态修正；但剩余失败模式包括跳覆盖不完整、干扰项锁定轨迹、过早停止误校准，以及即使完美检索仍存在的高组合失败率。总体而言，分阶段检索的影响常超过理想证据本身的存在性——我们为科学场景RAG系统的部署与诊断提供实践指南，并为构建更可靠可控的迭代式检索-推理框架奠定基础。