We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6% (8.6% improvement beyond format correction), and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7% (7.0% non-format gain). This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which contains the aforementioned example. Furthermore, RLVR with only two examples even slightly exceeds these results (MATH500: 74.8%, average: 36.6%). Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples. In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-category generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term post-saturation generalization. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the "grokking" phenomenon. We also show the critical role of promoting exploration (e.g., by incorporating entropy loss with an appropriate coefficient) in 1-shot RLVR training. We also further discuss related observations about format correction, label robustness and prompt modification. These findings can inspire future work on RLVR efficiency and encourage a re-examination of recent progress and the underlying mechanisms in RLVR. All resources are open source at https://github.com/ypwang61/One-Shot-RLVR.
翻译:我们证明,使用单一训练样本进行可验证奖励的强化学习(单样本RLVR)能有效提升大语言模型(LLMs)的数学推理能力。将RLVR应用于基础模型Qwen2.5-Math-1.5B,我们找到一个特定样本,可将模型在MATH500数据集上的表现从36.0%提升至73.6%(超越格式校正的净提升达8.6%),并将六个常见数学推理基准的平均性能从17.6%提升至35.7%(非格式相关增益为7.0%)。该结果与使用包含该样本的1.2k DeepScaleR子集获得的效果相当(MATH500: 73.6%,平均: 35.9%)。进一步地,仅使用两个样本的RLVR甚至略微超越这些结果(MATH500: 74.8%,平均: 36.6%)。在不同模型(Qwen2.5-Math-7B、Llama3.2-3B-Instruct、DeepSeek-R1-Distill-Qwen-1.5B)、强化学习算法(GRPO与PPO)及不同数学样本上均观察到类似的显著改进。此外,我们在单样本RLVR过程中发现若干有趣现象:跨类别泛化能力增强、自我反思频率提升,以及训练准确率饱和后测试性能仍持续改善的现象(我们称之为"饱和后泛化")。我们验证了单样本RLVR的有效性主要源于策略梯度损失,这使其区别于"顿悟"现象。研究还揭示了促进探索(如通过引入适当系数的熵损失)在单样本RLVR训练中的关键作用。我们进一步讨论了关于格式校正、标签鲁棒性和提示词修改的相关观察。这些发现可为RLVR效率的后续研究提供启示,并促使学界重新审视RLVR的最新进展及其内在机制。所有资源已在https://github.com/ypwang61/One-Shot-RLVR开源。