In this paper, we introduce Modality-Inconsistent Continual Learning (MICL), a new continual learning scenario for Multimodal Large Language Models (MLLMs) that involves tasks with inconsistent modalities (image, audio, or video) and varying task types (captioning or question-answering). Unlike existing vision-only or modality-incremental settings, MICL combines modality and task type shifts, both of which drive catastrophic forgetting. To address these challenges, we propose MoInCL, which employs a Pseudo Targets Generation Module to mitigate forgetting caused by task type shifts in previously seen modalities. It also incorporates Instruction-based Knowledge Distillation to preserve the model's ability to handle previously learned modalities when new ones are introduced. We benchmark MICL using a total of six tasks and conduct experiments to validate the effectiveness of our proposed MoInCL. The experimental results highlight the superiority of MoInCL, showing significant improvements over representative and state-of-the-art continual learning baselines.
翻译:本文提出模态不一致持续学习(MICL),这是针对多模态大语言模型(MLLMs)的一种新型持续学习场景,其任务涉及不一致的模态(图像、音频或视频)与变化的任务类型(描述生成或问答)。与现有的纯视觉或模态增量设置不同,MICL结合了模态与任务类型的双重转移,二者均会引发灾难性遗忘。为应对这些挑战,我们提出MoInCL方法,该方法采用伪目标生成模块以缓解先前所见模态中因任务类型转移导致的遗忘,同时引入基于指令的知识蒸馏以保持模型在引入新模态时对已学习模态的处理能力。我们使用总计六项任务对MICL进行基准测试,并通过实验验证所提MoInCL方法的有效性。实验结果凸显了MoInCL的优越性,相较于代表性及最先进的持续学习基线方法取得了显著提升。