Supervised Fine-Tuning (SFT) and Preference Optimization (PO) are two fundamental processes for enhancing the capabilities of Language Models (LMs) post pre-training, aligning them better with human preferences. Although SFT advances in training efficiency, PO delivers better alignment, thus they are often combined. However, common practices simply apply them sequentially without integrating their optimization objectives, ignoring the opportunities to bridge their paradigm gap and take the strengths from both. To obtain a unified understanding, we interpret SFT and PO with two sub-processes -- Preference Estimation and Transition Optimization -- defined at token level within the Markov Decision Process (MDP) framework. This modeling shows that SFT is only a specialized case of PO with inferior estimation and optimization. PO evaluates the quality of model's entire generated answer, whereas SFT only scores predicted tokens based on preceding tokens from target answers. Therefore, SFT overestimates the ability of model, leading to inferior optimization. Building on this view, we introduce Intuitive Fine-Tuning (IFT) to integrate SFT and Preference Optimization into a single process. IFT captures LMs' intuitive sense of the entire answers through a temporal residual connection, but it solely relies on a single policy and the same volume of non-preference-labeled data as SFT. Our experiments show that IFT performs comparably or even superiorly to sequential recipes of SFT and some typical Preference Optimization methods across several tasks, particularly those requires generation, reasoning, and fact-following abilities. An explainable Frozen Lake game further validates the effectiveness of IFT for getting competitive policy.
翻译:监督微调(SFT)与偏好优化(PO)是提升语言模型(LM)预训练后能力、使其更符合人类偏好的两个基本过程。尽管SFT在训练效率上具有优势,PO能实现更好的对齐效果,因此二者常结合使用。然而,常见实践仅简单地将它们顺序应用而未整合其优化目标,忽视了弥合其范式差距并融合双方优势的机会。为获得统一理解,我们在马尔可夫决策过程(MDP)框架下,将SFT和PO解释为两个在词元层面定义的子过程——偏好估计与转移优化。该建模表明SFT仅是PO的一种特殊形式,其估计与优化能力均较弱。PO评估模型生成的整体回答质量,而SFT仅基于目标答案的前序词元对预测词元进行评分。因此,SFT会高估模型能力,导致次优的优化效果。基于此观点,我们提出直观微调(IFT),将SFT与偏好优化整合为单一过程。IFT通过时序残差连接捕捉语言模型对整体答案的直观感知,且仅依赖单一策略及与SFT等量的非偏好标注数据。实验表明,在多项任务(尤其是需要生成、推理和事实遵循能力的任务)中,IFT的表现与SFT及某些典型偏好优化方法的顺序组合方案相当甚至更优。一个可解释的冰冻湖游戏进一步验证了IFT在获得竞争性策略方面的有效性。