Supervised fine-tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL). This gap is primarily driven by RL's use of on-policy data. We propose a framework to bridge this chasm by enabling On-Policy SFT. We first present \textbf{\textit{Distribution Discriminant Theory (DDT)}}, which explains and quantifies the alignment between data and the model-induced distribution. Leveraging DDT, we introduce two complementary techniques: (i) \textbf{\textit{In-Distribution Finetuning (IDFT)}}, a loss-level method to enhance generalization ability of SFT, and (ii) \textbf{\textit{Hinted Decoding}}, a data-level technique that can re-align the training corpus to the model's distribution. Extensive experiments demonstrate that our framework achieves generalization performance on par with prominent offline RL algorithms, including DPO and SimPO, while maintaining the efficiency of an SFT pipeline. The proposed framework thus offers a practical alternative in domains where RL is infeasible. We open-source the code here: https://github.com/zhangmiaosen2000/Towards-On-Policy-SFT
翻译:监督微调(SFT)虽然计算效率高,但其泛化能力通常逊于强化学习(RL)。这一差距主要源于RL对在线策略数据的使用。我们提出了一个框架,通过实现**在线策略SFT**来弥合这一鸿沟。我们首先提出**分布判别理论(DDT)**,该理论解释并量化了数据与模型诱导分布之间的一致性。基于DDT,我们引入了两种互补的技术:(i)**分布内微调(IDFT)**,一种损失层面的方法,用于增强SFT的泛化能力;(ii)**提示解码**,一种数据层面的技术,能够将训练语料库重新对齐到模型的分布。大量实验表明,我们的框架在泛化性能上达到了与主流离线RL算法(包括DPO和SimPO)相当的水平,同时保持了SFT流程的效率。因此,所提出的框架为那些RL不可行的领域提供了一个实用的替代方案。我们在此开源代码:https://github.com/zhangmiaosen2000/Towards-On-Policy-SFT