Prompt-tuning methods for Continual Learning (CL) freeze a large pre-trained model and focus training on a few parameter vectors termed prompts. Most of these methods organize these vectors in a pool of key-value pairs, and use the input image as query to retrieve the prompts (values). However, as keys are learned while tasks progress, the prompting selection strategy is itself subject to catastrophic forgetting, an issue often overlooked by existing approaches. For instance, prompts introduced to accommodate new tasks might end up interfering with previously learned prompts. To make the selection strategy more stable, we ask a foundational model (CLIP) to select our prompt within a two-level adaptation mechanism. Specifically, the first level leverages standard textual prompts for the CLIP textual encoder, leading to stable class prototypes. The second level, instead, uses these prototypes along with the query image as keys to index a second pool. The retrieved prompts serve to adapt a pre-trained ViT, granting plasticity. In doing so, we also propose a novel residual mechanism to transfer CLIP semantics to the ViT layers. Through extensive analysis on established CL benchmarks, we show that our method significantly outperforms both state-of-the-art CL approaches and the zero-shot CLIP test. Notably, our findings hold true even for datasets with a substantial domain gap w.r.t. the pre-training knowledge of the backbone model, as showcased by experiments on satellite imagery and medical datasets.
翻译:持续学习中的提示调优方法会冻结大型预训练模型,并专注于训练被称为“提示”的少量参数向量。大多数此类方法将这些向量组织成键值对池,利用输入图像作为查询来检索提示(值)。然而,由于键会随着任务进展而学习,提示选择策略本身容易受到灾难性遗忘的影响——这一问题常被现有方法忽视。例如,为适应新任务引入的提示可能最终干扰已学习的提示。为使选择策略更加稳定,我们借助基础模型CLIP,在双层适应机制中执行提示选择。具体而言,第一层利用标准文本提示优化CLIP文本编码器,从而生成稳定的类原型;第二层则将这些原型与查询图像共同作为键,索引第二个提示池。所检索的提示用于适配预训练的ViT,赋予模型可塑性。在此过程中,我们还提出了一种新颖的残差机制,将CLIP语义迁移至ViT层。通过在已建立的持续学习基准上的广泛分析,我们证明该方法显著优于现有最先进的持续学习方法和零样本CLIP测试。值得注意的是,即使面对与骨干模型预训练知识存在显著领域差异的数据集(如卫星图像和医学数据集实验所示),我们的发现依然成立。