Reinforcement Learning with Verifiable Rewards (RLVR), particularly with algorithms like Group Relative Policy Optimization (GRPO), has proven highly effective in enhancing the reasoning capabilities of large language models. However, a critical bottleneck in current pipelines lies in the limited diversity of sampled trajectories during group rollouts. Homogeneous trajectories and their associated rewards would diminish the return signals for policy updates, thereby hindering effective policy learning. This lack of diversity stems primarily from token-level stochastic sampling, where local variations are likely to collapse into near-identical reasoning paths. To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct continuations. Specifically, LATR iteratively operates in three stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each new branch, and (3) pruning branches that exhibits prolonged similarity during simulation. Compared with stochastic Sampling, LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy Optimization (DAPO) algorithms across different reasoning tasks. Our code and data are publicly available at https://github.com/starreeze/latr.
翻译:可验证奖励强化学习(RLVR),特别是采用群体相对策略优化(GRPO)等算法,已被证明在增强大语言模型的推理能力方面极为有效。然而,当前流程中的一个关键瓶颈在于群体采样过程中轨迹的多样性受限。同质化的轨迹及其相关奖励会削弱策略更新的回报信号,从而阻碍有效的策略学习。这种多样性不足主要源于词元级的随机采样,其中局部变异容易坍缩为近乎相同的推理路径。为解决这一局限,我们提出了基于前瞻树的轨迹采样策略(LATR),这是一种新颖的采样策略,旨在通过强制分支到可能产生不同后续序列的候选词元,显式提升轨迹级多样性。具体而言,LATR迭代执行三个阶段:(1)在高不确定性的生成步骤进行分支;(2)对每个新分支执行前瞻模拟;(3)在模拟过程中剪除持续相似的分支。与随机采样相比,LATR在不同推理任务上,对GRPO和动态采样策略优化(DAPO)算法平均加速策略学习131%,并提升最终pass@1性能4.2%。我们的代码和数据已公开于https://github.com/starreeze/latr。