Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually acquire new vision-language capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. To reduce inter-task interference and promote collaboration, recent methods often employ sparse architectures like Mixture of LoRA Experts with image-text similarity routing. However, tasks with distinct response structures could share highly similar visual-linguistic semantics and thus be wrongly routed to the same expert; image-text similarity alone is insufficient for reliable task assignment. For example, an expert in a grounding task requiring coordinate prediction may be biased toward producing short textual answers after learning semantically similar VQA tasks. This format-blind task assignment integrates heterogeneous response types into shared parameters, inducing gradient interference and ineffective expert collaboration. To address this problem, we propose ProtoAda, a prototype-guided adaptive tuning framework. ProtoAda introduces format-aware task prototypes to align task assignment and routing with both task semantics and output structure, and further consolidates format-compatible updates in a geometry-aware manner to effectively reuse and progressively refine existing parameters. Extensive experiments on multiple benchmarks demonstrate that ProtoAda achieves superior performance, especially on tasks whose answer structures are easily corrupted by sequential tuning.
翻译:多模态大语言模型(MLLMs)通过指令微调展现出强大性能,但实际部署要求其持续获取新的视觉-语言能力,这使得多模态持续指令微调(MCIT)至关重要。为降低任务间干扰并促进协作,近期方法常采用基于图像-文本相似度路由的混合LoRA专家等稀疏架构。然而,具有不同响应结构的任务可能共享高度相似的视觉-语言语义,导致被错误路由至同一专家;仅凭图像-文本相似度难以实现可靠的任务分配。例如,一个需要坐标预测的定位任务专家,在学习了语义相似的VQA任务后可能倾向于生成简短文本答案。这种格式盲区任务分配将异构响应类型整合至共享参数中,引发梯度干扰并导致专家协作失效。针对该问题,我们提出原型引导的自适应微调框架ProtoAda。ProtoAda引入格式感知任务原型,使任务分配与路由同时对齐任务语义和输出结构,并通过几何感知方式整合格式兼容的更新,有效复用并逐步精炼现有参数。多基准实验表明,ProtoAda实现了卓越性能,尤其在那些答案结构易被序列微调破坏的任务上表现突出。