Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning. Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed samples contribute nothing. We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-agnostic step coverage and soft scoring that extracts learning signals even from failed samples, and (2) Dual-Track Path Scoring with offline-generated reference planners that assesses paths from both self-consistency and reference-alignment perspectives. Experiments on multiple QA benchmarks demonstrate that Search-P1 achieves significant improvements over Search-R1 and other strong baselines, with an average accuracy gain of 7.7 points.
翻译:检索增强生成(RAG)通过整合外部知识来增强大语言模型(LLMs),但传统的单轮检索难以处理复杂的多步推理任务。智能RAG通过使LLMs能够动态决定何时检索以及检索什么来解决这一问题,然而当前基于强化学习的训练方法存在两个主要缺陷:稀疏的结果型奖励丢弃了中间信号,以及低样本效率导致失败样本无法贡献任何学习信息。我们提出了Search-P1框架,该框架为智能RAG训练引入了路径中心化的奖励塑形,包含两个核心组件:(1)路径中心化奖励,它通过顺序无关的步骤覆盖度和软评分来评估推理轨迹的结构质量,即使从失败样本中也能提取学习信号;(2)双轨路径评分,利用离线生成的参考规划器,从自洽性和参考对齐性两个角度评估路径。在多个问答基准测试上的实验表明,Search-P1相比Search-R1及其他强基线模型取得了显著提升,平均准确率提高了7.7个百分点。