Can language models improve their reasoning performance without external rewards, using only their own sampled responses for training? We show that they can. We propose Self-evolving Post-Training (SePT), a simple post-training method that alternates between self-generation and training on self-generated responses. It repeatedly samples questions, uses the model itself to generate low-temperature responses, and then finetunes the model on the self-generated data. In this self-training loop, we use an online data refresh mechanism, where each new batch is generated by the most recently updated model. Across six math reasoning benchmarks, SePT improves a strong no-training baseline, defined as the untuned base model evaluated at its best swept decoding temperature, on several tested models. In some settings, SePT can even approach the performance of Reinforcement Learning with Verifiable Rewards (RLVR). Additional ablations demonstrate the importance of online data refresh and temperature decoupling. Overall, our results identify a practical regime in which reasoning can be improved using self-generated supervision alone. Our code is available at https://github.com/ElementQi/SePT.
翻译:摘要:语言模型能否在不借助外部奖励的情况下,仅通过自身采样的响应进行训练来提升推理性能?我们证明了这一可能性。为此,我们提出自我进化后训练(SePT),一种简单的后训练方法,它在自我生成与基于自生成响应的训练之间交替进行。该方法反复采样问题,利用模型自身生成低温响应,随后在自生成数据上微调模型。在此自训练循环中,我们采用在线数据刷新机制,即每个新批次均由最新更新的模型生成。在六个数学推理基准测试上,SePT使强无训练基线(定义为未调优的基础模型在其最佳扫参解码温度下的评估结果)在多个测试模型上得到提升。在某些设置下,SePT甚至能接近基于可验证奖励的强化学习(RLVR)的性能。额外的消融实验证明了在线数据刷新与温度解耦的重要性。总体而言,我们的研究结果揭示了一种可行机制——仅通过自生成监督即可提升推理能力。我们的代码已开源在https://github.com/ElementQi/SePT。