Learning Prompt with Distribution-Based Feature Replay for Few-Shot Class-Incremental Learning

Few-shot Class-Incremental Learning (FSCIL) aims to continuously learn new classes based on very limited training data without forgetting the old ones encountered. Existing studies solely relied on pure visual networks, while in this paper we solved FSCIL by leveraging the Vision-Language model (e.g., CLIP) and propose a simple yet effective framework, named Learning Prompt with Distribution-based Feature Replay (LP-DiF). We observe that simply using CLIP for zero-shot evaluation can substantially outperform the most influential methods. Then, prompt tuning technique is involved to further improve its adaptation ability, allowing the model to continually capture specific knowledge from each session. To prevent the learnable prompt from forgetting old knowledge in the new session, we propose a pseudo-feature replay approach. Specifically, we preserve the old knowledge of each class by maintaining a feature-level Gaussian distribution with a diagonal covariance matrix, which is estimated by the image features of training images and synthesized features generated from a VAE. When progressing to a new session, pseudo-features are sampled from old-class distributions combined with training images of the current session to optimize the prompt, thus enabling the model to learn new knowledge while retaining old knowledge. Experiments on three prevalent benchmarks, i.e., CIFAR100, mini-ImageNet, CUB-200, and two more challenging benchmarks, i.e., SUN-397 and CUB-200$^*$ proposed in this paper showcase the superiority of LP-DiF, achieving new state-of-the-art (SOTA) in FSCIL. Code is publicly available at https://github.com/1170300714/LP-DiF.

翻译：小样本类增量学习旨在基于极少的训练数据持续学习新类别，同时不遗忘已学习的旧类别。现有研究仅依赖纯视觉网络，而本文通过利用视觉-语言模型（如CLIP）解决FSCIL问题，并提出一种简单有效的框架——基于分布特征重放的提示学习（LP-DiF）。我们发现，直接使用CLIP进行零样本评估即可显著超越最具影响力的方法。随后引入提示调优技术以进一步提升其适应能力，使模型能够持续从每个会话中捕获特定知识。为防止可学习提示在新会话中遗忘旧知识，我们提出一种伪特征重放方法。具体而言，我们通过维护一个特征级高斯分布（采用对角协方差矩阵）来保存每个类别的旧知识，该分布由训练图像的图像特征和变分自编码器生成的合成特征共同估计。进入新会话时，从旧类别分布中采样伪特征，并结合当前会话的训练图像优化提示，从而使模型在学习新知识的同时保留旧知识。在三个主流基准（CIFAR100、mini-ImageNet、CUB-200）及本文提出的两个更具挑战性基准（SUN-397和CUB-200$^*$）上的实验表明，LP-DiF展现了优越性，在FSCIL中取得了新的最优性能。代码已开源：https://github.com/1170300714/LP-DiF。