Training large language models to reason with search engines via reinforcement learning is hindered by a fundamental credit assignment problem: existing methods such as Search-R1 provide only a sparse outcome reward after an entire multi-step trajectory, making it infeasible to attribute success or failure to individual reasoning and retrieval decisions. Process-reward methods like StepSearch alleviate this by introducing step-level supervision, but rely on heuristic rewards such as TF-IDF overlap with gold documents, and still sample $k$ complete trajectories per example, retaining high gradient variance. We propose SLATE, a framework built on two complementary ideas: (1) truncated step-level sampling, which generates $k$ trajectories that share a common prefix and differ only at the next step, isolating variation to a single decision point; and (2) dense, decomposed LLM-as-judge rewards, which score each reasoning step, search query, and answer on a ternary scale with separate quality dimensions, providing richer supervision than binary outcome signals or undifferentiated step-level judgments. We theoretically prove that under the same dense reward structure, truncated sampling reduces the variance of advantage estimates by up to a factor of $T$ compared to full-trajectory sampling for $T$-step trajectories, yielding lower-variance and better-targeted policy gradients. Experiments on seven QA benchmarks confirm that SLATE consistently outperforms both sparse-reward and process-reward baselines, with the largest gains on harder multi-hop tasks and smaller models.
翻译:通过强化学习训练大型语言模型利用搜索引擎进行推理面临一个根本性的信用分配问题:现有方法(如Search-R1)仅在完整多步轨迹结束后提供稀疏的结果奖励,难以将成功或失败归因于个别的推理与检索决策。过程奖励方法(如StepSearch)通过引入步骤级监督缓解了此问题,但依赖于启发式奖励(例如与标准文档的TF-IDF重叠度),且仍需对每个示例采样$k$条完整轨迹,梯度方差仍然较高。我们提出SLATE框架,其基于两个互补思想:(1) 截断式步骤级采样:生成$k$条共享共同前缀、仅在下一步产生差异的轨迹,将变异隔离至单一决策点;(2) 稠密且分解的LLM-as-judge奖励:以三元尺度分别评估每个推理步骤、搜索查询和答案在不同质量维度上的得分,提供比二元结果信号或未区分的步骤级判断更丰富的监督。我们从理论上证明,在相同稠密奖励结构下,对于$T$步轨迹,截断采样相比完整轨迹采样可将优势估计的方差降低高达$T$倍,从而获得方差更低、目标更明确的策略梯度。在七个问答基准上的实验证实,SLATE在稀疏奖励与过程奖励基线方法中均取得稳定优势,且在更困难的多跳任务和较小模型上提升最为显著。