Reinforcement Learning with Verifiable Rewards (RLVR), particularly with algorithms like Group Relative Policy Optimization (GRPO), has proven highly effective in enhancing the reasoning capabilities of large language models. However, a critical bottleneck in current pipelines lies in the limited diversity of sampled trajectories during group rollouts. Homogeneous trajectories and their associated rewards would diminish the return signals for policy updates, thereby hindering effective policy learning. This lack of diversity stems primarily from token-level stochastic sampling, where local variations are likely to collapse into near-identical reasoning paths. To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct continuations. Specifically, LATR iteratively operates in three stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each new branch, and (3) pruning branches that exhibits prolonged similarity during simulation. Compared with stochastic Sampling, LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy Optimization (DAPO) algorithms across different reasoning tasks. Our code and data are publicly available at https://github.com/starreeze/latr.
翻译:可验证奖励强化学习(RLVR),特别是采用群体相对策略优化(GRPO)等算法,已证明在增强大语言模型推理能力方面极为有效。然而,当前流程中的一个关键瓶颈在于群体展开过程中采样轨迹的多样性有限。同质化的轨迹及其关联奖励会削弱策略更新的回报信号,从而阻碍有效的策略学习。这种多样性缺失主要源于词元级随机采样,其中局部变异容易坍缩为近乎相同的推理路径。为解决这一局限,我们提出基于前瞻树的展开策略(LATR),这是一种新颖的展开策略,旨在通过强制分支到可能产生不同延续的候选词元,显式提升轨迹级多样性。具体而言,LATR迭代执行三个阶段:(1)在高不确定性生成步骤进行分支,(2)对每个新分支执行前瞻模拟,(3)剪枝在模拟过程中表现出持续相似性的分支。与随机采样相比,LATR在不同推理任务上,平均将GRPO和动态采样策略优化(DAPO)算法的策略学习速度提升131%,最终pass@1性能提高4.2%。我们的代码和数据已公开于https://github.com/starreeze/latr。