Continual learning aims to enable a model to incrementally learn knowledge from sequentially arrived data. Previous works adopt the conventional classification architecture, which consists of a feature extractor and a classifier. The feature extractor is shared across sequentially arrived tasks or classes, but one specific group of weights of the classifier corresponding to one new class should be incrementally expanded. Consequently, the parameters of a continual learner gradually increase. Moreover, as the classifier contains all historical arrived classes, a certain size of the memory is usually required to store rehearsal data to mitigate classifier bias and catastrophic forgetting. In this paper, we propose a non-incremental learner, named AttriCLIP, to incrementally extract knowledge of new classes or tasks. Specifically, AttriCLIP is built upon the pre-trained visual-language model CLIP. Its image encoder and text encoder are fixed to extract features from both images and text. Text consists of a category name and a fixed number of learnable parameters which are selected from our designed attribute word bank and serve as attributes. As we compute the visual and textual similarity for classification, AttriCLIP is a non-incremental learner. The attribute prompts, which encode the common knowledge useful for classification, can effectively mitigate the catastrophic forgetting and avoid constructing a replay memory. We evaluate our AttriCLIP and compare it with CLIP-based and previous state-of-the-art continual learning methods in realistic settings with domain-shift and long-sequence learning. The results show that our method performs favorably against previous state-of-the-arts. The implementation code can be available at https://github.com/bhrqw/AttriCLIP.
翻译:持续学习旨在使模型能够从顺序到达的数据中增量式地学习知识。以往的工作采用传统的分类架构,由特征提取器和分类器组成。特征提取器在顺序到达的任务或类别间共享,但分类器中对应新类别的一组特定权重需要增量式扩展。因此,持续学习器的参数会逐渐增加。此外,由于分类器包含所有已学习的类别,通常需要一定大小的记忆存储回放数据以缓解分类器偏差和灾难性遗忘。本文提出一种名为AttriCLIP的非增量式学习器,用于增量式提取新类别或新任务的知识。具体而言,AttriCLIP基于预训练的视觉-语言模型CLIP构建,其图像编码器和文本编码器固定不变,用于提取图像和文本特征。文本由类别名称和固定数量的可学习参数组成,这些参数从我们设计的属性词库中选取,作为属性表征。由于我们通过计算视觉-文本相似度进行分类,AttriCLIP是一种非增量式学习器。编码了对分类有用的通用知识的属性提示,能够有效缓解灾难性遗忘,并避免构建回放记忆。我们在具有域偏移和长序列学习的现实场景中评估了AttriCLIP,并将其与基于CLIP的方法及先前最先进的持续学习方法进行比较。结果表明,我们的方法性能优于现有最先进方法。实现代码见https://github.com/bhrqw/AttriCLIP。