Multimodal Large Language Models (MLLMs) unify heterogeneous vision-language tasks under a shared generative framework via instruction tuning, yet real-world deployment demands continuous capability expansion, making Multimodal Continual Instruction Tuning (MCIT) essential. Existing methods either update all tasks with a shared parameter set or allocate dedicated modules for each new task. Shared updates force heterogeneous tasks to compete, causing forgetting of learned capabilities. Conversely, isolated expansion prevents interference but severely limits parameter efficiency over long task streams. To address this dilemma, we propose CRAM. Specifically, by isolating task-specific patterns into independent modules, CRAM mitigates catastrophic forgetting across tasks. To further boost parameter efficiency, we utilize adaptive-rank instantiation to identify the capability gap between existing expert capability and new task demands, and dynamically allocate only the necessary parameters. To ensure stable reuse among tasks, centroid-guided routing recognizes and activates existing experts' capabilities, while an orthogonality penalty confines new updates to task-specific directions, preventing re-learning general capability. Extensive experiments across diverse benchmarks consistently demonstrate its superiority over existing methods.
翻译:多模态大语言模型(MLLMs)通过指令微调将异质的视觉-语言任务统一到共享生成框架下,然而实际部署需要持续扩展能力,因此多模态持续指令微调(MCIT)至关重要。现有方法要么使用共享参数集更新所有任务,要么为每个新任务分配专用模块。共享更新迫使异质任务相互竞争,导致已学能力遗忘;而隔离扩展虽避免干扰,但会严重限制长任务流中的参数效率。为解决这一困境,我们提出CRAM方法。具体而言,通过将任务特定模式隔离至独立模块,CRAM缓解了跨任务的灾难性遗忘。为进一步提升参数效率,我们利用自适应秩实例化来识别现有专家能力与新任务需求之间的能力差距,并动态分配仅需参数。为确保任务间的稳定复用,质心引导路由识别并激活现有专家能力,同时正交性惩罚将新更新约束在任务特定方向上,避免重新学习通用能力。在多种基准上的广泛实验一致证明了其相较于现有方法的优越性。