CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models

Instruction tuning in multimodal large language models (MLLMs) aims to smoothly integrate a backbone LLM with a pre-trained feature encoder for downstream tasks. The major challenge is how to efficiently find the synergy through cooperative learning where LLMs adapt their reasoning abilities in downstream tasks while feature encoders adjust their encoding to provide more relevant modal information. In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives, where we find unbalanced learning between the two components, i.e., the feature encoder and the LLM, can cause diminishing learning gradients that slow the model convergence and often lead to sub-optimal results due to insufficient learning. Inspired by our findings, we propose a measurement to quantitatively evaluate the learning balance, based on which we further design a dynamic learning scheduler that better coordinates the learning. In addition, we introduce an auxiliary loss regularization method to promote updating of the generation distribution of MLLMs considering the learning state of each model component, which potentially prevents each component from gradient diminishing and enables a more accurate estimation of the learning balance coefficient. We conduct experiments with multiple LLM backbones and feature encoders, where our techniques are model-agnostic and can be generically integrated with various MLLM backbones. Experiment results on multiple downstream tasks and modalities in vision and audio, demonstrate the proposed method's better efficiency and effectiveness in MLLM instruction tuning.

翻译：多模态大语言模型（MLLM）中的指令调优旨在将骨干大语言模型与预训练特征编码器平滑集成，以用于下游任务。其主要挑战在于如何通过协同学习有效找到两者间的协同作用，使大语言模型在下游任务中调整其推理能力，同时让特征编码器调整其编码以提供更相关的模态信息。本文从理论和实证角度分析了MLLM的指令调优过程，发现两个组件（即特征编码器与大语言模型）之间的学习不平衡会导致梯度衰减，从而减缓模型收敛速度，并常因学习不足而产生次优结果。基于此发现，我们提出了一种定量评估学习平衡的度量方法，并据此设计了一种能更好协调学习的动态学习调度器。此外，我们引入了一种辅助损失正则化方法，该方法结合各模型组件的学习状态来促进MLLM生成分布的更新，从而潜在地防止各组件出现梯度衰减，并能更准确地估计学习平衡系数。我们在多种大语言模型骨干和特征编码器上进行了实验，所提技术具有模型无关性，可通用地集成到各类MLLM骨干中。在视觉与音频模态的多个下游任务上的实验结果表明，所提方法在MLLM指令调优中具有更高的效率与更好的效果。