We present a simple, self-help online supervised finetuning (OSFT) paradigm for LLM reasoning. In this paradigm, the model generates its own responses and is immediately finetuned on this self-generated data. OSFT is a highly efficient training strategy for LLM reasoning, as it is reward-free and uses just one rollout by default. Experiment results show that OSFT achieves downstream performance on challenging mathematical reasoning tasks comparable to strong reinforcement learning with verifiable rewards (RLVR) methods such as GRPO. Our ablation study further demonstrates the efficiency and robustness of OSFT. The major mechanism of OSFT lies in facilitating the model's own existing preference (latent knowledge) learned from pretraining, which leads to reasoning ability improvement. We believe that OSFT offers an efficient and promising alternative to more complex, reward-based training paradigms. Our code is available at https://github.com/ElementQi/OnlineSFT.
翻译:本文提出了一种简单、自助式的在线监督微调范式,用于大语言模型的推理任务。在该范式中,模型生成自身的响应,并立即基于这些自生成数据进行微调。OSFT是一种针对大语言模型推理的高效训练策略,因为它无需奖励机制,且默认仅需单次推理迭代。实验结果表明,在具有挑战性的数学推理任务上,OSFT所达到的下游性能可与基于可验证奖励的强化学习方法(如GRPO)相媲美。我们的消融研究进一步证实了OSFT的高效性与鲁棒性。OSFT的核心机制在于促进模型从预训练中习得的自身固有偏好(潜在知识),从而提升推理能力。我们相信,OSFT为更复杂、基于奖励的训练范式提供了一种高效且有前景的替代方案。相关代码已发布于 https://github.com/ElementQi/OnlineSFT。