Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities

Shuangshuang Ying,Zheyu Wang,Yunjian Peng,Jin Chen,Yuhao Wu,Hongbin Lin,Dingyu He,Siyi Liu,Gengchen Yu,YinZhu Piao,Yuchen Wu,Xin Gui,Zhongyuan Peng,Xin Li,Xeron Du,Libo Qin,YiXin Cao,Ge Zhang

Despite strong performance on existing benchmarks, it remains unclear whether large language models can reason over genuinely novel scientific information. Most evaluations score end-to-end RAG pipelines, where reasoning is confounded with retrieval and toolchain choices, and the signal is further contaminated by parametric memorization and open-web volatility. We introduce DeR2, a controlled deep-research sandbox that isolates document-grounded reasoning while preserving core difficulties of deep search: multi-step synthesis, denoising, and evidence-based conclusion making. DeR2 decouples evidence access from reasoning via four regimes--Instruction-only, Concepts (gold concepts without documents), Related-only (only relevant documents), and Full-set (relevant documents plus topically related distractors)--yielding interpretable regime gaps that operationalize retrieval loss vs. reasoning loss and enable fine-grained error attribution. To prevent parametric leakage, we apply a two-phase validation that requires parametric failure without evidence while ensuring oracle-concept solvability. To ensure reproducibility, each instance provides a frozen document library (drawn from 2023-2025 theoretical papers) with expert-annotated concepts and validated rationales. Experiments across a diverse set of state-of-the-art foundation models reveal substantial variation and significant headroom: some models exhibit mode-switch fragility, performing worse with the Full-set than with Instruction-only, while others show structural concept misuse, correctly naming concepts but failing to execute them as procedures.

翻译：尽管现有基准测试表现出色，但大型语言模型是否能够对真正新颖的科学信息进行推理仍不明确。大多数评估方法对端到端RAG流水线进行评分，其中推理过程与检索及工具链选择相互混淆，且评估信号进一步受到参数记忆和开放网络波动性的干扰。我们提出DeR2——一个受控的深度研究沙盒，该框架在保持深度搜索核心难点（多步综合、去噪和基于证据的结论推导）的同时，实现了文档驱动推理的隔离评估。DeR2通过四种机制解耦证据获取与推理过程：仅指令模式、概念模式（无文档的黄金概念）、相关文档模式（仅相关文档）和全集模式（相关文档加主题相关干扰项），由此产生的可解释机制差异将检索损失与推理损失操作化，支持细粒度错误归因。为防止参数泄露，我们采用两阶段验证机制：要求模型在无证据时必然失败，同时确保基于预言概念的可行性。为保证可复现性，每个测试实例均提供冻结文档库（源自2023-2025年理论论文）及专家标注的概念与验证推理链。对多种前沿基础模型的实验表明存在显著性能差异与提升空间：部分模型呈现模式切换脆弱性，在全集模式下的表现反而差于仅指令模式；另一些模型则表现出结构性概念误用，能够正确指称概念但无法将其作为操作流程执行。