ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation

Reinforcement learning (RL) has become a promising paradigm for optimizing Retrieval-Augmented Generation (RAG) in complex reasoning tasks. However, traditional outcome-based RL approaches often suffer from reward sparsity and inefficient credit assignment, as coarse-grained scalar rewards fail to identify specific erroneous steps within long-horizon trajectories. This ambiguity frequently leads to "process hallucinations", where models reach correct answers through flawed logic or redundant retrieval steps. Although recent process-aware approaches attempt to mitigate this via static preference learning or heuristic reward shaping, they often lack the on-policy exploration capabilities required to decouple step-level credit from global outcomes. To address these challenges, we propose ProRAG, a process-supervised reinforcement learning framework designed to integrate learned step-level supervision into the online optimization loop. Our framework consists of four stages: (1) Supervised Policy Warmup to initialize the model with a structured reasoning format; (2) construction of an MCTS-based Process Reward Model (PRM) to quantify intermediate reasoning quality; (3) PRM-Guided Reasoning Refinement to align the policy with fine-grained process preferences; and (4) Process-Supervised Reinforcement Learning with a dual-granularity advantage mechanism. By aggregating step-level process rewards with global outcome signals, ProRAG provides precise feedback for every action. Extensive experiments on five multi-hop reasoning benchmarks demonstrate that ProRAG achieves superior overall performance compared to strong outcome-based and process-aware RL baselines, particularly on complex long-horizon tasks, validating the effectiveness of fine-grained process supervision. The code and model are available at https://github.com/lilinwz/ProRAG.

翻译：强化学习已成为优化复杂推理任务中检索增强生成的一种前景广阔的方法。然而，传统的基于结果的强化学习方法常受奖励稀疏性和低效信用分配问题的困扰，因为粗粒度的标量奖励难以识别长轨迹中的具体错误步骤。这种模糊性常导致"过程幻觉"，即模型通过有缺陷的逻辑或冗余的检索步骤得出正确答案。尽管近期基于过程感知的方法尝试通过静态偏好学习或启发式奖励塑形来缓解此问题，但它们往往缺乏将步骤级信用与全局结果解耦所需的在线策略探索能力。为应对这些挑战，我们提出了ProRAG——一个旨在将习得的步骤级监督集成到在线优化循环中的过程监督强化学习框架。该框架包含四个阶段：(1) 监督策略预热，以结构化推理格式初始化模型；(2) 构建基于蒙特卡洛树搜索的过程奖励模型，以量化中间推理质量；(3) PRM引导的推理细化，使策略与细粒度过程偏好对齐；(4) 采用双粒度优势机制的过程监督强化学习。通过聚合步骤级过程奖励与全局结果信号，ProRAG能为每个动作提供精确反馈。在五个多跳推理基准测试上的大量实验表明，相较于强大的基于结果和过程感知的强化学习基线方法，ProRAG实现了更优的整体性能，尤其在复杂长程任务上表现突出，验证了细粒度过程监督的有效性。代码与模型发布于 https://github.com/lilinwz/ProRAG。