Vision-Language Models (VLMs) are typically trained on a diverse set of multi-modal domains, yet current practices rely on costly manual tuning. We propose MaD-Mix, a principled and computationally efficient framework that derives multi-modal data mixtures for VLM training. MaD-Mix formulates data mixing as modality-aware domain alignment maximization and obtains closed-form multi-modal alignment scores from the Fenchel dual through inter-modal coupling variables. MaD-Mix systematically handles domains with missing modalities, allowing for the integration of language-only domains. Empirical evaluations across 0.5B and 7B models demonstrate that MaD-Mix accelerates VLM training across diverse benchmarks. MaD-Mix matches human-tuned data mixtures using 22% fewer training steps in image-text instruction tuning. In complex tri-modal video-image-text scenarios, where manual tuning becomes impractical, MaD-Mix boosts average accuracy over uniform weights, with negligible mixture computation overhead (< 1 GPU-hour), enabling scalable mixture design for modern VLM pipelines.
翻译:视觉语言模型(VLM)通常在多样化的多模态领域数据集上进行训练,然而当前实践依赖于成本高昂的人工调优。我们提出了MaD-Mix,这是一个原理清晰且计算高效的框架,可为VLM训练推导出多模态数据混合方案。MaD-Mix将数据混合问题形式化为模态感知的领域对齐最大化,并通过模态间耦合变量从Fenchel对偶中获得闭式的多模态对齐分数。MaD-Mix系统性地处理存在模态缺失的领域,允许整合纯语言领域。在0.5B和7B参数模型上的实证评估表明,MaD-Mix能在多种基准测试中加速VLM训练。在图像-文本指令微调任务中,MaD-Mix使用比人工调优方案少22%的训练步数即可达到同等性能。在复杂的三模态视频-图像-文本场景中,人工调优变得不切实际,而MaD-Mix在均匀权重基准上提升了平均准确率,其混合计算开销可忽略不计(<1 GPU小时),从而为现代VLM训练流程实现了可扩展的混合方案设计。