With the advent of large-scale pre-trained models, interest in adapting and exploiting them for continual learning scenarios has grown. In this paper, we propose an approach to exploiting pre-trained vision-language models (e.g. CLIP) that enables further adaptation instead of only using zero-shot learning of new tasks. We augment a pre-trained CLIP model with additional layers after the Image Encoder or before the Text Encoder. We investigate three different strategies: a Linear Adapter, a Self-attention Adapter, each operating on the image embedding, and Prompt Tuning which instead modifies prompts input to the CLIP text encoder. We also propose a method for parameter retention in the adapter layers that uses a measure of parameter importance to better maintain stability and plasticity during incremental learning. Our experiments demonstrate that the simplest solution -- a single Linear Adapter layer with parameter retention -- produces the best results. Experiments on several conventional benchmarks consistently show a significant margin of improvement over the current state-of-the-art.
翻译:随着大规模预训练模型的出现,如何将其适配并应用于持续学习场景引起了广泛关注。本文提出一种利用预训练视觉-语言模型(如CLIP)的方法,该方法不仅限于对新任务进行零样本学习,还能实现进一步的模型适配。我们在CLIP模型的图像编码器之后或文本编码器之前添加额外层,并探索三种不同策略:线性适配器、自注意力适配器(两者均作用于图像嵌入)以及提示微调(通过修改输入至CLIP文本编码器的提示)。同时,我们提出一种基于参数重要性的适配层参数保留方法,以在增量学习过程中更好地维持稳定性与可塑性。实验表明,最简单的方案——单层线性适配器结合参数保留——取得了最优结果。在多个传统基准测试上的实验一致显示,本方法相较现有最先进技术具有显著提升。