Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world, which requires a learning system to adapt to new tasks without forgetting former ones. While traditional CIL methods focus on visual information to grasp core features, recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations with the aid of textual information. However, when continually trained with new classes, VLMs often suffer from catastrophic forgetting of former knowledge. Applying VLMs to CIL poses two major challenges: 1) how to adapt the model without forgetting; and 2) how to make full use of the multi-modal information. To this end, we propose PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting. To handle the first challenge, we propose training task-specific projections based on the frozen image/text encoders. When facing new tasks, new projections are expanded and former projections are fixed, alleviating the forgetting of old concepts. For the second challenge, we propose the fusion module to better utilize the cross-modality information. By jointly adjusting visual and textual features, the model can capture semantic information with stronger representation ability. Extensive experiments on nine benchmark datasets validate PROOF achieves state-of-the-art performance.
翻译:类增量学习(CIL)或持续学习是现实世界中所需的一种能力,它要求学习系统在适应新任务的同时不遗忘旧任务。传统CIL方法侧重于视觉信息以把握核心特征,而近期视觉-语言模型(VLM)的进展展示了借助文本信息学习可泛化表征的潜力。然而,当持续训练新类别时,VLM常遭受对旧知识的灾难性遗忘。将VLM应用于CIL面临两大挑战:1)如何在适应模型时避免遗忘;2)如何充分利用多模态信息。为此,我们提出投影融合(PROOF)方法,使VLM能够实现无遗忘学习。针对第一个挑战,我们提出基于冻结的图像/文本编码器训练任务特定投影。面对新任务时,扩展新投影并固定旧投影,从而缓解对旧概念的遗忘。针对第二个挑战,我们提出融合模块以更好地利用跨模态信息。通过联合调整视觉和文本特征,模型能以更强的表征能力捕捉语义信息。在九个基准数据集上的大量实验验证了PROOF达到了最先进的性能。