Search-integrated reasoning enables language agents to transcend static parametric knowledge by actively querying external sources. However, training these agents via reinforcement learning is hindered by the multi-scale credit assignment problem: existing methods typically rely on sparse, trajectory-level rewards that fail to distinguish between high-quality reasoning and fortuitous guesses, leading to redundant or misleading search behaviors. To address this, we propose Search-R2, a novel Actor-Refiner collaboration framework that enhances reasoning through targeted intervention, with both components jointly optimized during training. Our approach decomposes the generation process into an Actor, which produces initial reasoning trajectories, and a Meta-Refiner, which selectively diagnoses and repairs flawed steps via a 'cut-and-regenerate' mechanism. To provide fine-grained supervision, we introduce a hybrid reward design that couples outcome correctness with a dense process reward quantifying the information density of retrieved evidence. Theoretically, we formalize the Actor-Refiner interaction as a smoothed mixture policy, proving that selective correction yields strict performance gains over strong baselines. Extensive experiments across various general and multi-hop QA datasets demonstrate that Search-R2 consistently outperforms strong RAG and RL-based baselines across model scales, achieving superior reasoning accuracy with minimal overhead.
翻译:搜索集成推理使语言智能体能够通过主动查询外部源来超越静态参数化知识。然而,通过强化学习训练这些智能体受到多尺度信用分配问题的阻碍:现有方法通常依赖于稀疏的轨迹级奖励,无法区分高质量推理与侥幸猜测,导致冗余或误导性的搜索行为。为解决此问题,我们提出了Search-R2,一种新颖的执行者-精炼者协作框架,通过针对性干预来增强推理,两个组件在训练期间共同优化。我们的方法将生成过程分解为一个执行者(产生初始推理轨迹)和一个元精炼者(通过“切割-再生”机制选择性地诊断并修复有缺陷的步骤)。为了提供细粒度监督,我们引入了一种混合奖励设计,将结果正确性与量化检索证据信息密度的密集过程奖励相结合。理论上,我们将执行者-精炼者交互形式化为一种平滑混合策略,证明选择性校正相比强基线能带来严格的性能提升。在多种通用和多跳问答数据集上的广泛实验表明,Search-R2在不同模型规模上始终优于强检索增强生成和基于强化学习的基线,以最小开销实现了卓越的推理准确性。