Multimodal Large Language Models (MLLMs) suffer from severe training inefficiency issue, which is associated with their massive model sizes and visual token numbers. Existing efforts in efficient training focus on reducing model sizes or trainable parameters. Inspired by the success of Visual Token Pruning (VTP) in improving inference efficiency, we are exploring another substantial research direction for efficient training by reducing visual tokens. However, applying VTP at the training stage results in a training-inference mismatch: pruning-trained models perform poorly when inferring on non-pruned full visual token sequences. To close this gap, we propose DualSpeed, a fast-slow framework for efficient training of MLLMs. The fast-mode is the primary mode, which incorporates existing VTP methods as plugins to reduce visual tokens, along with a mode isolator to isolate the model's behaviors. The slow-mode is the auxiliary mode, where the model is trained on full visual sequences to retain training-inference consistency. To boost its training, it further leverages self-distillation to learn from the sufficiently trained fast-mode. Together, DualSpeed can achieve both training efficiency and non-degraded performance. Experiments show DualSpeed accelerates the training of LLaVA-1.5 by 2.1$\times$ and LLaVA-NeXT by 4.0$\times$, retaining over 99% performance. Code: https://github.com/dingkun-zhang/DualSpeed
翻译:多模态大语言模型(MLLMs)存在严重的训练效率低下问题,这与其庞大的模型规模和视觉令牌数量有关。现有高效训练的研究主要集中在减少模型规模或可训练参数上。受视觉令牌剪枝(VTP)在提升推理效率方面成功的启发,我们正探索通过减少视觉令牌来实现高效训练的另一重要研究方向。然而,在训练阶段应用VTP会导致训练-推理不匹配:经过剪枝训练的模型在基于未剪枝的完整视觉令牌序列进行推理时表现不佳。为弥合这一差距,我们提出了DualSpeed,一个用于MLLMs高效训练的快慢框架。快速模式是主要模式,它将现有的VTP方法作为插件集成以减少视觉令牌,并配备一个模式隔离器来隔离模型的行为。慢速模式是辅助模式,模型在此模式下基于完整的视觉序列进行训练,以保持训练-推理一致性。为了提升其训练效果,该模式进一步利用自蒸馏技术,从已充分训练的快速模式中学习。DualSpeed由此能够同时实现训练效率和性能不下降。实验表明,DualSpeed将LLaVA-1.5的训练速度提升了2.1倍,将LLaVA-NeXT的训练速度提升了4.0倍,同时保持了超过99%的性能。代码:https://github.com/dingkun-zhang/DualSpeed