Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs). Large-scale pre-training plays a critical role in building capable LVLMs, while evaluating its training quality without the costly supervised fine-tuning stage is under-explored. Loss, perplexity, and in-context evaluation results are commonly used pre-training metrics for Large Language Models (LLMs), while we observed that these metrics are less indicative when aligning a well-trained LLM with a new modality. Due to the lack of proper metrics, the research of LVLMs in the critical pre-training stage is hindered greatly, including the training data choice, efficient module design, etc. In this paper, we propose evaluating the pre-training quality from the inter-modal distribution distance perspective and present MIR, the Modality Integration Rate, which is 1) \textbf{Effective} to represent the pre-training quality and show a positive relation with the benchmark performance after supervised fine-tuning. 2) \textbf{Robust} toward different training/evaluation data. 3) \textbf{Generalize} across training configurations and architecture choices. We conduct a series of pre-training experiments to explore the effectiveness of MIR and observe satisfactory results that MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results. We hope MIR could be a helpful metric for building capable LVLMs and inspire the following research about modality alignment in different areas. Our code is at: https://github.com/shikiw/Modality-Integration-Rate.

翻译：本文提出模态融合率（MIR），一种有效、鲁棒且泛化性强的度量指标，用于表征大型视觉语言模型（LVLMs）的多模态预训练质量。大规模预训练对构建高性能LVLM至关重要，然而如何在不依赖昂贵监督微调阶段的情况下评估其训练质量仍缺乏深入研究。损失值、困惑度及上下文评估结果常被用作大型语言模型（LLMs）的预训练指标，但我们发现这些指标在将训练良好的LLM与新模态对齐时表征能力不足。由于缺乏合适的度量标准，LVLM在关键预训练阶段的研究（包括训练数据选择、高效模块设计等）受到严重制约。本文从模态间分布距离的视角评估预训练质量，提出模态融合率MIR，该指标具有以下特性：1）\textbf{有效性}——能准确表征预训练质量，且与监督微调后的基准性能呈正相关；2）\textbf{鲁棒性}——对不同训练/评估数据保持稳定；3）\textbf{泛化性}——适用于不同训练配置与架构选择。我们通过一系列预训练实验验证MIR的有效性，发现MIR能有效指导训练数据筛选、训练策略调度及模型架构设计以获得更优预训练结果。我们希望MIR能成为构建高性能LVLM的有益度量标准，并为跨领域的模态对齐研究提供启发。代码开源地址：https://github.com/shikiw/Modality-Integration-Rate。