TreePS-RAG：基于树的流程监督用于智能RAG中的强化学习 (TreePS-RAG: Tree-based Process Supervision for Reinforcement Learning in Agentic RAG)

Agentic retrieval-augmented generation (RAG) formulates question answering as a multi-step interaction between reasoning and information retrieval, and has recently been advanced by reinforcement learning (RL) with outcome-based supervision. While effective, relying solely on sparse final rewards limits step-wise credit assignment and provides weak guidance for intermediate reasoning and actions. Recent efforts explore process-level supervision, but typically depend on offline constructed training data, which risks distribution shift, or require costly intermediate annotations. We present TreePS-RAG, an online, tree-based RL framework for agentic RAG that enables step-wise credit assignment while retaining standard outcome-only rewards. Our key insight is to model agentic RAG reasoning as a rollout tree, where each reasoning step naturally maps to a node. This tree structure allows step utility to be estimated via Monte Carlo estimation over its descendant outcomes, yielding fine-grained process advantages without requiring intermediate labels. To make this paradigm practical, we introduce an efficient online tree construction strategy that preserves exploration diversity under a constrained computational budget. With a rollout cost comparable to strong baselines like Search-R1, experiments on seven multi-hop and general QA benchmarks across multiple model scales show that TreePS-RAG consistently and significantly outperforms both outcome-supervised and leading process-supervised RL methods.

翻译：智能检索增强生成（RAG）将问答问题建模为推理与信息检索之间的多步交互过程，近期通过结合基于结果的监督的强化学习（RL）取得了进展。尽管有效，但仅依赖稀疏的最终奖励会限制逐步的信用分配，并为中间推理和行动提供较弱的指导。近期的研究探索了流程层面的监督，但通常依赖于离线构建的训练数据（这存在分布偏移的风险），或需要昂贵的中间标注。我们提出了TreePS-RAG，一个用于智能RAG的在线、基于树的RL框架，它能够在保留标准仅结果奖励的同时实现逐步信用分配。我们的核心洞见是将智能RAG推理建模为一个展开树，其中每个推理步骤自然地映射到一个节点。这种树结构允许通过对其后代结果进行蒙特卡洛估计来评估步骤效用，从而产生细粒度的流程优势，而无需中间标签。为使这一范式实用化，我们引入了一种高效的在线树构建策略，该策略在有限的计算预算下保持了探索的多样性。在展开成本与Search-R1等强基线相当的情况下，在多个模型规模上的七个多跳和通用问答基准测试表明，TreePS-RAG始终显著优于基于结果监督的以及领先的基于流程监督的RL方法。