Multimodal Continual Instruction Tuning (MCIT) enables Multimodal Large Language Models (MLLMs) to meet continuously emerging requirements without expensive retraining. MCIT faces two major obstacles: catastrophic forgetting (where old knowledge is forgotten) and negative forward transfer (where the performance of future tasks is degraded). Although existing methods have greatly alleviated catastrophic forgetting, they still suffer from negative forward transfer. By performing singular value decomposition (SVD) on input embeddings, we discover a large discrepancy in different input embeddings. The discrepancy results in the model learning irrelevant information for old and pre-trained tasks, which leads to catastrophic forgetting and negative forward transfer. To address these issues, we propose Fwd-Prompt, a prompt-based method projecting prompt gradient to the residual space to minimize the interference between tasks and to the pre-trained subspace for reusing pre-trained knowledge. Our experiments demonstrate that Fwd-Prompt achieves state-of-the-art performance while updating fewer parameters and requiring no old samples. Our research sheds light on the potential of continuously adapting MLLMs to new tasks under the instruction tuning paradigm and encourages future studies to explore MCIT. The code will soon be publicly available.
翻译:多模态持续指令微调(MCIT)使多模态大语言模型(MLLMs)能够满足不断涌现的新需求,而无需昂贵的重新训练。MCIT面临两大障碍:灾难性遗忘(旧知识被遗忘)和负向正向迁移(未来任务性能下降)。尽管现有方法已大幅缓解灾难性遗忘问题,但仍受困于负向正向迁移。通过对输入嵌入执行奇异值分解(SVD),我们发现不同输入嵌入之间存在显著差异。这种差异导致模型学习到与旧任务及预训练任务无关的信息,进而引发灾难性遗忘和负向正向迁移。为解决这些问题,我们提出Fwd-Prompt——一种基于提示的方法,将提示梯度投影至残差空间以最小化任务间干扰,并投影至预训练子空间以复用预训练知识。实验表明,Fwd-Prompt在减少参数更新量且无需旧样本的情况下达到了最优性能。本研究揭示了在指令微调范式下持续适配MLLMs至新任务的潜力,并鼓励未来探索MCIT领域。代码即将开源。