Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) are two fundamental processes for enhancing the capabilities of Language Models (LMs) post pre-training, aligning them better with human preferences. Although SFT advances in training efficiency, RLHF delivers better alignment, thus they are often combined. However, common practices simply apply them sequentially without unifying their optimization targets, resulting in a trade-off between fitting different objectives, and ignoring the opportunities to bridge the paradigm gap and take the strength from both. To obtain a unified understanding, we interpret SFT and RLHF using two sub-processes -- Preference Estimation and Transition Optimization -- defined at token level within the Markov Decision Process (MDP) framework. This modeling shows that SFT is only a specialized case of RLHF with inferior estimation and optimization. RLHF evaluates the quality of model's entire generated answer, whereas SFT only scores predicted tokens based on preceding tokens from target answers. Therefore, SFT overestimates the ability of model, leading to inferior optimization. Building on this view, we introduce Intuitive Fine-tuning (IFT) to integrate SFT and RLHF into a single process. IFT captures LMs' intuitive sense of the entire answers through a temporal residual connection, while using a single policy and the same volume of non-preference-labeled data as SFT. Our experiments show that IFT performs comparably or even superiorly to sequential recipes of SFT and some typical alignment methods across several tasks, particularly those requires generation, reasoning, and fact-following abilities. An explainable Frozen Lake game further validates the effectiveness of IFT.
翻译:监督微调(Supervised Fine-Tuning, SFT)与基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)是语言模型(Language Models, LMs)预训练后增强其能力、使其更符合人类偏好的两个基础过程。尽管SFT在训练效率上有所提升,但RLHF能实现更好的对齐效果,因此两者常被结合使用。然而,常见的做法仅是顺序应用两者,并未统一其优化目标,导致在拟合不同目标时出现权衡,并忽视了弥合范式差异、兼取两者之长的机遇。为获得统一的理解,我们基于马尔可夫决策过程(Markov Decision Process, MDP)框架,在词元级别将SFT和RLHF分解为两个子过程——偏好估计(Preference Estimation)与过渡优化(Transition Optimization)。这一建模表明,SFT仅是RLHF的一个特例,其估计与优化均存在不足。RLHF评估模型整个生成回答的质量,而SFT仅基于目标答案中的前序词元对预测词元进行评分。因此,SFT高估了模型能力,导致优化效果不佳。基于这一视角,我们引入直觉微调(Intuitive Fine-tuning, IFT),将SFT与RLHF整合为单一过程。IFT通过时间残差连接捕捉模型对整个答案的直觉感知,同时使用单一策略和与SFT相同数量的非偏好标注数据。实验表明,IFT在多项任务(尤其是需要生成、推理和事实遵循能力的任务)上,其表现与SFT及某些典型对齐方法的顺序组合相当,甚至更优。可解释的冰湖游戏进一步验证了IFT的有效性。