Foundations models are presented as generalists that often perform well over a myriad of tasks. Fine-tuning these models, even on limited data, provides an additional boost in task-specific performance but often at the cost of their wider generalization, an effect termed catastrophic forgetting. In this paper, we analyze the relation between task difficulty in the CLIP model and the performance of several simple parameter-efficient fine-tuning methods through the lens of domain generalization and catastrophic forgetting. We provide evidence that the silhouette score of the zero-shot image and text embeddings is a better measure of task difficulty than the average cosine similarity of correct image/label embeddings, and discuss observable relationships between task difficulty, fine-tuning method, domain generalization, and catastrophic forgetting. Additionally, the averaged results across tasks and performance measures demonstrate that a simplified method that trains only a subset of attention weights, which we call A-CLIP, yields a balance between domain generalization and catastrophic forgetting.
翻译:基础模型通常被呈现为通才,能在众多任务上表现良好。对这些模型进行微调,即使数据有限,也能提升特定任务的表现,但往往以牺牲其更广泛的泛化能力为代价,这种现象称为灾难性遗忘。本文通过领域泛化与灾难性遗忘的视角,分析了CLIP模型中任务难度与几种简单参数高效微调方法性能之间的关系。我们证明,零样本图像和文本嵌入的轮廓分数比正确图像/标签嵌入的平均余弦相似度更能衡量任务难度,并讨论了任务难度、微调方法、领域泛化与灾难性遗忘之间的可观察关系。此外,跨任务与性能指标的平均结果表明,一种仅训练部分注意力权重的简化方法(我们称之为A-CLIP)能在领域泛化与灾难性遗忘之间取得平衡。