Multimodal incremental learning needs to digest the information from multiple modalities while concurrently learning new knowledge without forgetting the previously learned information. There are numerous challenges for this task, mainly including the larger storage size of multimodal data in exemplar-based methods and the computational requirement of finetuning on huge multimodal models. In this paper, we leverage the parameter-efficient tuning scheme to reduce the burden of fine-tuning and propose the exemplar masking framework to efficiently replay old knowledge. Specifically, the non-important tokens are masked based on the attention weights and the correlation across different modalities, significantly reducing the storage size of an exemplar and consequently saving more exemplars under the same memory buffer. Moreover, we design a multimodal data augmentation technique to diversify exemplars for replaying prior knowledge. In experiments, we not only evaluate our method in existing multimodal datasets but also extend the ImageNet-R dataset to a multimodal dataset as a real-world application, where captions are generated by querying multimodal large language models (e.g., InstructBLIP). Extensive experiments show that our exemplar masking framework is more efficient and robust to catastrophic forgetting under the same limited memory buffer. Code is available at https://github.com/YiLunLee/Exemplar_Masking_MCIL.
翻译:多模态增量学习需要在消化来自多个模态的信息的同时,持续学习新知识而不遗忘先前学到的信息。这项任务面临诸多挑战,主要包括基于范例的方法中多模态数据较大的存储开销,以及在大型多模态模型上进行微调的计算需求。本文利用参数高效调优方案来减轻微调负担,并提出了范例掩码框架以高效回放旧知识。具体而言,基于注意力权重和跨模态相关性对非重要标记进行掩码,显著减少了单个范例的存储大小,从而在相同内存缓冲区下能保存更多范例。此外,我们设计了一种多模态数据增强技术,以多样化用于回放先验知识的范例。在实验中,我们不仅在现有多模态数据集上评估了我们的方法,还将ImageNet-R数据集扩展为一个多模态数据集作为现实应用,其中描述是通过查询多模态大语言模型(例如InstructBLIP)生成的。大量实验表明,在相同的有限内存缓冲区下,我们的范例掩码框架在应对灾难性遗忘方面更加高效和鲁棒。代码可在 https://github.com/YiLunLee/Exemplar_Masking_MCIL 获取。