Visual instruction tuning (VIT) enables multimodal large language models (MLLMs) to effectively handle a wide range of vision tasks by framing them as language-based instructions. Building on this, continual visual instruction tuning (CVIT) extends the capability of MLLMs to incrementally learn new tasks, accommodating evolving functionalities. While prior work has advanced CVIT through the development of new benchmarks and approaches to mitigate catastrophic forgetting, these efforts largely follow traditional continual learning paradigms, neglecting the unique challenges specific to CVIT. We identify a dual form of catastrophic forgetting in CVIT, where MLLMs not only forget previously learned visual understanding but also experience a decline in instruction following abilities as they acquire new tasks. To address this, we introduce the Separable Mixture of Low-Rank Adaptation (SMoLoRA) framework, which employs separable routing through two distinct modules - one for visual understanding and another for instruction following. This dual-routing design enables specialized adaptation in both domains, preventing forgetting while improving performance. Furthermore, we propose a novel CVIT benchmark that goes beyond existing benchmarks by additionally evaluating a model's ability to generalize to unseen tasks and handle diverse instructions across various tasks. Extensive experiments demonstrate that SMoLoRA outperforms existing methods in mitigating dual forgetting, improving generalization to unseen tasks, and ensuring robustness in following diverse instructions.
翻译:视觉指令调优(VIT)通过将各类视觉任务构建为基于语言的指令,使多模态大语言模型(MLLMs)能够有效处理广泛的视觉任务。在此基础上,持续视觉指令调优(CVIT)进一步扩展了MLLMs的能力,使其能够增量学习新任务,适应不断演进的功能需求。尽管先前研究通过开发新的基准测试方法和缓解灾难性遗忘的策略推动了CVIT的发展,但这些工作大多遵循传统的持续学习范式,忽视了CVIT特有的独特挑战。我们发现CVIT中存在一种双重形式的灾难性遗忘:MLLMs在学习新任务时,不仅会遗忘先前习得的视觉理解能力,其指令跟随能力也会随之下降。为解决这一问题,我们提出了可分离低秩自适应混合(SMoLoRA)框架,该框架通过两个独立模块——一个用于视觉理解,另一个用于指令跟随——实现可分离路由。这种双路由设计使得两个领域都能进行专业化适配,在防止遗忘的同时提升性能。此外,我们提出了一种新颖的CVIT基准测试,该基准不仅超越了现有测试范围,还额外评估了模型在未见任务上的泛化能力以及处理跨任务多样化指令的能力。大量实验表明,SMoLoRA在缓解双重遗忘、提升对未见任务的泛化能力以及确保遵循多样化指令的鲁棒性方面均优于现有方法。