MERGETUNE: Continued fine-tuning of vision-language models

Fine-tuning vision-language models (VLMs) such as CLIP often leads to catastrophic forgetting of pretrained knowledge. Prior work primarily aims to mitigate forgetting during adaptation; however, forgetting often remains inevitable during this process. We introduce a novel paradigm, continued fine-tuning (CFT), which seeks to recover pretrained knowledge after a zero-shot model has already been adapted. We propose a simple, model-agnostic CFT strategy (named MERGETUNE) guided by linear mode connectivity (LMC), which can be applied post hoc to existing fine-tuned models without requiring architectural changes. Given a fine-tuned model, we continue fine-tuning its trainable parameters (e.g., soft prompts or linear heads) to search for a continued model which has two low-loss paths to the zero-shot (e.g., CLIP) and the fine-tuned (e.g., CoOp) solutions. By exploiting the geometry of the loss landscape, the continued model implicitly merges the two solutions, restoring pretrained knowledge lost in the fine-tuned counterpart. A challenge is that the vanilla LMC constraint requires data replay from the pretraining task. We approximate this constraint for the zero-shot model via a second-order surrogate, eliminating the need for large-scale data replay. Experiments show that MERGETUNE improves the harmonic mean of CoOp by +5.6% on base-novel generalisation without adding parameters. On robust fine-tuning evaluations, the LMC-merged model from MERGETUNE surpasses ensemble baselines with lower inference cost, achieving further gains and state-of-the-art results when ensembled with the zero-shot model. Our code is available at https://github.com/Surrey-UP-Lab/MERGETUNE.

翻译：对CLIP等视觉语言模型（VLM）进行微调常导致预训练知识的灾难性遗忘。先前研究主要致力于在适应过程中减轻遗忘；然而，在此过程中遗忘往往仍不可避免。我们提出了一种新颖的持续微调（CFT）范式，旨在零样本模型完成适应后恢复其预训练知识。我们提出了一种基于线性模式连通性（LMC）指导的简单、模型无关的CFT策略（命名为MERGETUNE），该策略可事后应用于现有微调模型而无需改变架构。给定一个微调模型，我们持续微调其可训练参数（如软提示或线性头），以寻找一个与零样本（如CLIP）和微调（如CoOp）解之间存在两条低损失路径的持续模型。通过利用损失景观的几何特性，该持续模型隐式融合了两个解，恢复了微调对应模型中丢失的预训练知识。一个挑战在于原始LMC约束需要预训练任务的数据回放。我们通过二阶代理近似实现零样本模型的该约束，从而无需大规模数据回放。实验表明，MERGETUNE在不增加参数的情况下，将CoOp在基础-新类别泛化上的调和平均值提升了+5.6%。在鲁棒微调评估中，MERGETUNE生成的LMC融合模型以更低推理成本超越集成基线，当与零样本模型集成时可实现进一步增益并达到最先进结果。代码发布于https://github.com/Surrey-UP-Lab/MERGETUNE。