Recently, large language models (LLMs) have demonstrated remarkable capabilities in a wide range of tasks. Typically, an LLM is pre-trained on large corpora and subsequently fine-tuned on task-specific datasets. However, during finetuning, LLMs may forget the knowledge acquired in the pretraining stage, leading to a decline in general capabilities. To address this issue, we propose a new fine-tuning algorithm termed Momentum-Filtered Optimizer (MoFO). The key idea of MoFO is to iteratively select and update the model parameters with the largest momentum magnitudes. Compared to full-parameter training, MoFO achieves similar fine-tuning performance while keeping parameters closer to the pre-trained model, thereby mitigating knowledge forgetting. Unlike most existing methods for forgetting mitigation, MoFO combines the following two advantages. First, MoFO does not require access to pre-training data. This makes MoFO particularly suitable for fine-tuning scenarios where pre-training data is unavailable, such as fine-tuning checkpoint-only open-source LLMs. Second, MoFO does not alter the original loss function. This could avoid impairing the model performance on the fine-tuning tasks. We validate MoFO through rigorous convergence analysis and extensive experiments, demonstrating its superiority over existing methods in mitigating forgetting and enhancing fine-tuning performance.
翻译:近年来,大语言模型(LLMs)在广泛的任务中展现出卓越的能力。通常,大语言模型首先在大规模语料上进行预训练,随后在特定任务的数据集上进行微调。然而,在微调过程中,大语言模型可能会遗忘预训练阶段获得的知识,导致其通用能力下降。为解决这一问题,我们提出了一种称为动量滤波优化器(MoFO)的新型微调算法。MoFO的核心思想是迭代地选择并更新动量幅值最大的模型参数。与全参数训练相比,MoFO在实现相近微调性能的同时,使参数更接近预训练模型,从而缓解知识遗忘。与大多数现有的遗忘缓解方法不同,MoFO兼具以下两大优势。首先,MoFO无需访问预训练数据。这使得MoFO特别适用于预训练数据不可用的微调场景,例如仅基于检查点开源的大语言模型的微调。其次,MoFO不改变原始损失函数。这可以避免损害模型在微调任务上的性能。我们通过严格的收敛性分析和大量实验验证了MoFO,证明其在缓解遗忘和提升微调性能方面优于现有方法。