Prompt-tuning methods for Continual Learning (CL) freeze a large pre-trained model and focus training on a few parameter vectors termed prompts. Most of these methods organize these vectors in a pool of key-value pairs, and use the input image as query to retrieve the prompts (values). However, as keys are learned while tasks progress, the prompting selection strategy is itself subject to catastrophic forgetting, an issue often overlooked by existing approaches. For instance, prompts introduced to accommodate new tasks might end up interfering with previously learned prompts. To make the selection strategy more stable, we ask a foundational model (CLIP) to select our prompt within a two-level adaptation mechanism. Specifically, the first level leverages standard textual prompts for the CLIP textual encoder, leading to stable class prototypes. The second level, instead, uses these prototypes along with the query image as keys to index a second pool. The retrieved prompts serve to adapt a pre-trained ViT, granting plasticity. In doing so, we also propose a novel residual mechanism to transfer CLIP semantics to the ViT layers. Through extensive analysis on established CL benchmarks, we show that our method significantly outperforms both state-of-the-art CL approaches and the zero-shot CLIP test. Notably, our findings hold true even for datasets with a substantial domain gap w.r.t. the pre-training knowledge of the backbone model, as showcased by experiments on satellite imagery and medical datasets.
翻译:持续学习(CL)中的提示调优方法冻结大型预训练模型,仅对称为提示的少量参数向量进行训练。大多数方法将这些向量组织为键-值对池,并使用输入图像作为查询来检索提示(值)。然而,由于键会随任务进展而学习,提示选择策略本身容易遭受灾难性遗忘——这一常被现有方法忽视的问题。例如,为适应新任务而引入的提示可能最终干扰先前学习的提示。为使选择策略更加稳定,我们借助基础模型(CLIP)在两级自适应机制中完成提示选择。具体而言,第一级利用标准文本提示作用于CLIP文本编码器,生成稳定的类别原型;第二级则将这些原型与查询图像共同作为键来索引第二个提示池,检索到的提示用于适配预训练的ViT,赋予模型可塑性。在此过程中,我们提出一种新颖的残差机制,将CLIP语义迁移至ViT层。通过在权威持续学习基准上的广泛分析,我们证明该方法显著优于现有最先进的持续学习方案及零样本CLIP测试。值得注意的是,即使在主干模型预训练知识与目标任务存在显著领域差异的数据集上(如卫星图像与医学数据集实验所示),我们的结论依然成立。