With the emergence of Transformers and Vision-Language Models (VLMs) such as CLIP, large pre-trained models have become a common strategy to enhance performance in Continual Learning scenarios. This led to the development of numerous prompting strategies to effectively fine-tune transformer-based models without succumbing to catastrophic forgetting. However, these methods struggle to specialize the model on domains significantly deviating from the pre-training and preserving its zero-shot capabilities. In this work, we propose Continual Generative training for Incremental prompt-Learning, a novel approach to mitigate forgetting while adapting a VLM, which exploits generative replay to align prompts to tasks. We also introduce a new metric to evaluate zero-shot capabilities within CL benchmarks. Through extensive experiments on different domains, we demonstrate the effectiveness of our framework in adapting to new tasks while improving zero-shot capabilities. Further analysis reveals that our approach can bridge the gap with joint prompt tuning. The codebase is available at https://github.com/aimagelab/mammoth.
翻译:随着Transformer和视觉语言模型(如CLIP)的出现,大规模预训练模型已成为提升持续学习性能的常用策略。这推动了多种提示策略的发展,旨在有效微调基于Transformer的模型,同时避免灾难性遗忘。然而,这些方法难以使模型在显著偏离预训练分布的领域实现专业化,并同时保持其零样本能力。本文提出“增量提示学习的持续生成式训练”——一种利用生成式回放将提示与任务对齐的新方法,以在适应视觉语言模型时缓解遗忘问题。我们还引入了一种新的评估指标,用于衡量持续学习基准中的零样本能力。通过在多个领域的广泛实验,我们证明了该框架在适应新任务的同时提升零样本能力的有效性。进一步分析表明,我们的方法能够弥合与联合提示调优之间的差距。代码库已公开于 https://github.com/aimagelab/mammoth。