Despite the remarkable performance of multimodal large language models (MLLMs) across diverse tasks, the substantial training and inference costs impede their advancement. The majority of computation stems from the overwhelming volume of vision tokens processed by the transformer decoder. In this paper, we propose to build efficient MLLMs by leveraging the Mixture-of-Depths (MoD) mechanism, where each transformer decoder layer selects essential vision tokens to process while skipping redundant ones. However, integrating MoD into MLLMs is non-trivial. To address the challenges of training and inference stability as well as limited training data, we adapt the MoD module with two novel designs: tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing). Moreover, we observe that vision tokens exhibit higher redundancy in deeper layer and thus design a progressive ratio decay (PRD) strategy, which gradually reduces the token retention ratio layer by layer, employing a shifted cosine schedule. This crucial design fully unleashes the potential of MoD, significantly boosting the efficiency and performance of our models. To validate the effectiveness of our approach, we conduct extensive experiments with two baseline models across 14 benchmarks. Our model, p-MoD, matches or even surpasses the performance of the baseline models, with only 55.6% TFLOPs and 53.8% KV cache storage during inference, and 77.7% GPU hours during training.
翻译:尽管多模态大语言模型(MLLMs)在各种任务上表现出卓越的性能,但其高昂的训练和推理成本阻碍了其进一步发展。大部分计算源于Transformer解码器所需处理的庞大视觉令牌数量。本文提出通过利用混合深度(MoD)机制来构建高效的MLLMs,其中每个Transformer解码器层选择处理必要的视觉令牌,同时跳过冗余令牌。然而,将MoD集成到MLLMs中并非易事。为解决训练和推理稳定性以及有限训练数据的挑战,我们通过两项新颖设计对MoD模块进行适配:tanh门控权重归一化(TanhNorm)和对称令牌重加权(STRing)。此外,我们观察到视觉令牌在更深层表现出更高的冗余度,因此设计了一种渐进比例衰减(PRD)策略,该策略采用平移余弦调度,逐层逐步降低令牌保留比例。这一关键设计充分释放了MoD的潜力,显著提升了我们模型的效率和性能。为验证方法的有效性,我们在14个基准测试上对两个基线模型进行了广泛实验。我们的模型p-MoD在推理时仅需55.6%的TFLOPs和53.8%的KV缓存存储,训练时仅需77.7%的GPU小时,其性能与基线模型相当甚至更优。