Reinforcement learning has emerged as an effective paradigm for training large language models to interleave reasoning with search engine calls. However, existing approaches face a fundamental credit assignment problem: methods like Search-R1 assign a single outcome reward to the entire multi-step trajectory, providing no signal about which reasoning or retrieval decisions were responsible for success or failure. Process-reward methods such as StepSearch introduce step-level supervision but still sample complete trajectories independently, so advantage estimates at any given step are contaminated by the randomness of all other steps. We propose SLATE (Step-Level Advantage estimation for Truncated Exploration), which addresses both problems through two complementary ideas. First, truncated step-level sampling generates k continuations from a shared prefix, isolating all variation to a single decision point. We prove this reduces the variance of advantage estimates by up to a factor of T compared to full-trajectory sampling for T-step trajectories, the first formal variance guarantee for step-level RL in retrieval-augmented reasoning. Second, dense, decomposed process rewards separately evaluate reasoning quality, query quality, and answer correctness on a ternary scale via an LLM judge, providing richer supervision than binary outcome signals or heuristic step-level scores. Experiments on seven QA benchmarks show that SLATE consistently outperforms both sparse-reward and process-reward baselines, achieving a 7.0% relative improvement over Search-R1 on the 7B model and 30.7% on the 3B model. Gains are largest on challenging multi-hop tasks, and ablations confirm that truncated sampling and dense rewards provide complementary benefits.
翻译:强化学习已成为训练大型语言模型将推理与搜索引擎调用交织进行的有效范式。然而,现有方法面临根本性的信用分配问题:Search-R1等方法将单一结果奖励分配给整个多步轨迹,无法提供关于哪些推理或检索决策导致成功或失败的信号。StepSearch等过程奖励方法引入了步级监督,但仍独立采样完整轨迹,因此任意给定步的优势估计会受到所有其他步骤随机性的干扰。我们提出SLATE(用于截断探索的步级优势估计),通过两个互补思想解决这两个问题。首先,截断步级采样从共享前缀生成k个连续片段,将所有变化隔离到单个决策点。我们证明,与完整轨迹采样相比,对于T步轨迹,这可将优势估计的方差最多降低T倍——这是检索增强推理中步级强化学习的首个形式化方差保证。其次,密集分解的过程奖励通过LLM判断器在三元尺度上分别评估推理质量、查询质量和答案正确性,提供比二元结果信号或启发式步级得分更丰富的监督。在七个问答基准上的实验表明,SLATE始终优于稀疏奖励和过程奖励基线,在7B模型上相对于Search-R1实现了7.0%的相对改进,在3B模型上实现了30.7%的相对改进。在挑战性多跳任务上收益最大,消融实验证实截断采样和密集奖励提供了互补的益处。