Multimodal learning can complete the picture of information extraction by uncovering key dependencies between data sources. However, current systems fail to fully leverage multiple modalities for optimal performance. This has been attributed to modality competition, where modalities strive for training resources, leaving some underoptimized. We show that current balancing methods struggle to train multimodal models that surpass even simple baselines, such as ensembles. This raises the question: how can we ensure that all modalities in multimodal training are sufficiently trained, and that learning from new modalities consistently improves performance? This paper proposes the Multimodal Competition Regularizer (MCR), a new loss component inspired by mutual information (MI) decomposition designed to prevent the adverse effects of competition in multimodal training. Our key contributions are: 1) Introducing game-theoretic principles in multimodal learning, where each modality acts as a player competing to maximize its influence on the final outcome, enabling automatic balancing of the MI terms. 2) Refining lower and upper bounds for each MI term to enhance the extraction of task-relevant unique and shared information across modalities. 3) Suggesting latent space permutations for conditional MI estimation, significantly improving computational efficiency. MCR outperforms all previously suggested training strategies and is the first to consistently improve multimodal learning beyond the ensemble baseline, clearly demonstrating that combining modalities leads to significant performance gains on both synthetic and large real-world datasets.
翻译:多模态学习通过揭示数据源之间的关键依赖关系,能够完善信息提取的整体图景。然而,现有系统未能充分利用多种模态以实现最优性能。这归因于模态竞争现象,即各模态竞相争夺训练资源,导致部分模态未能得到充分优化。我们证明,当前平衡方法难以训练出超越简单基线(如集成模型)的多模态模型。这引出了一个核心问题:如何确保多模态训练中所有模态都得到充分训练,并且从新模态中学习能持续提升性能?本文提出多模态竞争正则化器(MCR),这是一种受互信息(MI)分解启发的新型损失函数组件,旨在防止多模态训练中竞争带来的负面影响。我们的主要贡献包括:1)将博弈论原理引入多模态学习,其中每个模态作为参与者竞争最大化其对最终结果的影响,从而实现互信息项的自动平衡;2)优化每个互信息项的下界和上界,以增强跨模态任务相关独特信息与共享信息的提取;3)提出用于条件互信息估计的潜空间置换方法,显著提升计算效率。MCR在性能上超越了所有先前提出的训练策略,并且首次在多模态学习中实现了对集成基线的持续超越,这清晰地证明:在合成数据集和大型真实数据集上,融合多模态能够带来显著的性能提升。