Continual post-training (CPT) is a popular and effective technique for adapting foundation models like multimodal large language models to specific and ever-evolving downstream tasks. While existing research has primarily concentrated on methods like data replay, model expansion, or parameter regularization, the fundamental role of the learning paradigm within CPT remains largely unexplored. This paper presents a comparative analysis of two core post-training paradigms: supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), investigating their respective impacts on knowledge retention during CPT. Our experiments are conducted on a benchmark comprising seven diverse multimodal tasks, utilizing Qwen2.5-VL-7B-Instruct as the base model for continual post-training. The investigation yields two significant findings: (1) When continuously learning on downstream tasks, SFT leads to catastrophic forgetting of previously learned tasks. In contrast, RFT inherently preserves prior knowledge and achieve performance comparable to multi-task training. (2) RFT successfully protects and even enhances the model's general knowledge on standard benchmarks (e.g., MMMU and MMLU-Pro). Conversely, SFT degrades general model capabilities severely. Further analysis reveals that this stability is not primarily due to explicit mechanisms like KL penalty or chain-of-thought reasoning. Instead, we identify an implicit regularization mechanism inherent to RFT as a key contributing factor. Our theoretical analysis suggests that RFT's gradient updates are naturally scaled by the reward variance, acting as a data-dependent regularizer that inherently protects previously acquired knowledge. Finally, we propose a rollout-based instance filtering algorithm to enhance the stability and efficiency of RFT. Our comprehensive study demonstrates the superiority of RFT as a robust paradigm for continual post-training.
翻译:持续后训练是一种流行且有效的技术,用于将多模态大语言模型等基础模型适配到特定且不断演化的下游任务。尽管现有研究主要集中于数据回放、模型扩展或参数正则化等方法,但学习范式在持续后训练中的根本作用在很大程度上仍未得到探索。本文对两种核心后训练范式——监督微调与强化微调——进行了比较分析,探究了它们在持续后训练中对知识保留的各自影响。我们的实验在包含七个多样化多模态任务的基准上进行,使用Qwen2.5-VL-7B-Instruct作为持续后训练的基础模型。研究得出两个重要发现:(1)在下游任务上持续学习时,监督微调会导致对先前学习任务的灾难性遗忘。相比之下,强化微调能固有地保留先验知识,并达到与多任务训练相当的性能。(2)强化微调成功保护甚至增强了模型在标准基准上的通用知识。相反,监督微调会严重削弱模型的通用能力。进一步分析表明,这种稳定性主要并非源于KL惩罚或思维链推理等显式机制。相反,我们识别出强化微调固有的一种隐式正则化机制是关键因素。我们的理论分析表明,强化微调的梯度更新自然地由奖励方差进行缩放,这作为一种数据依赖的正则化器,固有地保护了先前获得的知识。最后,我们提出了一种基于轨迹的实例过滤算法,以增强强化微调的稳定性和效率。我们的综合研究证明了强化微调作为持续后训练的鲁棒范式的优越性。