mergetune: Continued fine-tuning of vision-language models

Fine-tuning vision-language models (VLMs) such as CLIP often leads to catastrophic forgetting of pretrained knowledge. Prior work primarily aims to mitigate forgetting during adaptation; however, forgetting often remains inevitable during this process. We introduce a novel paradigm, \emph{continued fine-tuning (CFT)}, which seeks to recover pretrained knowledge after a zero-shot model has already been adapted. We propose a simple, model-agnostic CFT strategy (named MERGETUNE) guided by linear mode connectivity (LMC), which can be applied post hoc to existing fine-tuned models without requiring architectural changes. Given a fine-tuned model, we continue fine-tuning its trainable parameters (e.g., soft prompts or linear heads) to search for a continued model which has two low-loss paths to the zero-shot (e.g., CLIP) and the fine-tuned (e.g., CoOp) solutions. By exploiting the geometry of the loss landscape, the continued model implicitly merges the two solutions, restoring pretrained knowledge lost in the fine-tuned counterpart. A challenge is that the vanilla LMC constraint requires data replay from the pretraining task. We approximate this constraint for the zero-shot model via a second-order surrogate, eliminating the need for large-scale data replay. Experiments show that MERGETUNE improves the harmonic mean of CoOp by +5.6\% on base-novel generalisation without adding parameters. % We show \emph{the first time} superior performance than CLIP on both DTD and EuroSAT, on cross-dataset transfer. On robust fine-tuning evaluations, the LMC-merged model from MERGETUNE surpasses ensemble baselines with lower inference cost, achieving further gains and state-of-the-art results when ensembled with the zero-shot model. Our code is available at \href{https://github.com/Surrey-UP-Lab/MERGETUNE}{https://github.com/Surrey-UP-Lab/MERGETUNE}.

翻译：对视觉语言模型（如CLIP）进行微调通常会导致预训练知识的灾难性遗忘。先前的研究主要致力于在适应过程中减轻遗忘；然而，在此过程中遗忘往往仍然不可避免。我们引入了一种新颖的范式——\emph{持续微调}，其目标是在零样本模型已经适应后恢复预训练知识。我们提出了一种简单、模型无关的CFT策略（命名为MERGETUNE），该策略以线性模式连接性为指导，可以事后应用于现有的微调模型，而无需更改架构。给定一个微调模型，我们继续微调其可训练参数（例如软提示或线性头），以寻找一个持续模型，该模型具有通往零样本（例如CLIP）和微调（例如CoOp）解决方案的两条低损失路径。通过利用损失景观的几何特性，持续模型隐式地合并了这两种解决方案，恢复了在微调对应模型中丢失的预训练知识。一个挑战是，原始的LMC约束需要来自预训练任务的数据回放。我们通过二阶替代方法为模型近似了这一约束，从而消除了对大规模数据回放的需求。实验表明，MERGETUNE在不增加参数的情况下，将CoOp在基础-新类别泛化上的调和平均值提高了+5.6\%。% 我们首次在跨数据集迁移上，在DTD和EuroSAT上展示了优于CLIP的性能。在鲁棒微调评估中，来自MERGETUNE的LMC合并模型以更低的推理成本超越了集成基线，当与零样本模型集成时实现了进一步的增益和最先进的结果。我们的代码可在 \href{https://github.com/Surrey-UP-Lab/MERGETUNE}{https://github.com/Surrey-UP-Lab/MERGETUNE} 获取。