Reinforcement learning with verifiable rewards (RLVR) has recently enhanced the reasoning capabilities of large language models (LLMs), particularly for mathematical problem solving. However, a fundamental limitation remains: as the sampling budget increases, the advantage of RLVR-trained models over their pretrained bases often diminishes or even vanishes, revealing a strong dependence on the base model's restricted search space. We attribute this phenomenon to the widespread use of the reverse Kullback-Leibler (KL) divergence regularizer, whose mode-seeking behavior keeps the policy trapped inside the base model's support region and hampers wider exploration. To address this issue, we propose RAPO (Rewards-Aware Policy Optimization), an algorithm to promote broader yet focused exploration. Our method (i) utilizes the forward KL penalty to replace the reverse KL penalty for out-of-distribution exploration, and (ii) reweights the reference policy to facilitate adaptive in-distribution exploration. We train Qwen2.5-3B and 7B models with RAPO on the 8K SimpleRL-Zero dataset, without supervised fine-tuning, and evaluate them on AIME2024 and AIME2025. Results show that RAPO consistently improves problem-solving performance. Notably, RAPO enables models to surpass the base model's performance ceiling and solves previously intractable problems, advancing the frontier of RLVR for challenging reasoning tasks.
翻译:带有可验证奖励的强化学习(RLVR)近期显著增强了大型语言模型(LLMs)的推理能力,特别是在数学问题求解方面。然而,一个根本性局限依然存在:随着采样预算的增加,RLVR训练模型相较于其预训练基线的优势往往会减弱甚至消失,这揭示了模型对基线受限搜索空间的强烈依赖。我们将此现象归因于广泛使用的反向Kullback-Leibler(KL)散度正则化器,其模式寻求行为使策略受限于基线模型的支持区域,阻碍了更广泛的探索。为解决这一问题,我们提出RAPO(奖励感知策略优化)算法,以促进更广泛且聚焦的探索。我们的方法(i)利用前向KL惩罚替代反向KL惩罚以进行分布外探索,以及(ii)重新加权参考策略以促进自适应分布内探索。我们在8K SimpleRL-Zero数据集上使用RAPO训练Qwen2.5-3B和7B模型(未进行监督微调),并在AIME2024和AIME2025上进行评估。结果表明,RAPO持续提升了问题求解性能。值得注意的是,RAPO使模型能够突破基线模型的性能上限,并解决了先前难以处理的问题,从而推动了RLVR在复杂推理任务中的前沿进展。