Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks, as well as on code generation tasks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.
翻译:可验证奖励强化学习(RLVR)近期已成为大型语言模型(LLM)后训练的关键范式,尤其在复杂推理任务中。然而,传统RLVR训练已被证明会以牺牲策略熵为代价提升Pass@1性能,导致生成多样性降低并限制Pass@k性能——后者通常代表LLM推理能力的上限。本文从训练问题的角度系统分析策略生成多样性,发现通过增强和更新训练问题有助于缓解训练过程中的熵崩溃现象。基于此观察,我们提出一种用于RLVR训练的在线自博弈变分问题合成(SvS)策略,该策略利用策略生成的正确解合成变分问题,同时确保其参考答案与原始问题保持一致。这种自改进策略在训练期间有效维持策略熵,相比标准RLVR显著提升Pass@k性能,实现持续改进,并在竞赛级AIME24和AIME25基准测试以及代码生成任务中,使Pass@32性能分别获得18.3%和22.8%的绝对提升。在3B至32B不同规模模型的12个推理基准测试中,实验一致证明了SvS策略的普适性与鲁棒性。