Multimodal learning methods with targeted unimodal learning objectives have exhibited their superior efficacy in alleviating the imbalanced multimodal learning problem. However, in this paper, we identify the previously ignored gradient conflict between multimodal and unimodal learning objectives, potentially misleading the unimodal encoder optimization. To well diminish these conflicts, we observe the discrepancy between multimodal loss and unimodal loss, where both gradient magnitude and covariance of the easier-to-learn multimodal loss are smaller than the unimodal one. With this property, we analyze Pareto integration under our multimodal scenario and propose MMPareto algorithm, which could ensure a final gradient with direction that is common to all learning objectives and enhanced magnitude to improve generalization, providing innocent unimodal assistance. Finally, experiments across multiple types of modalities and frameworks with dense cross-modal interaction indicate our superior and extendable method performance. Our method is also expected to facilitate multi-task cases with a clear discrepancy in task difficulty, demonstrating its ideal scalability. The source code and dataset are available at https://github.com/GeWu-Lab/MMPareto_ICML2024.
翻译:带有针对性单模态学习目标的多模态学习方法在缓解不平衡多模态学习问题上已展现出卓越效能。然而,本文揭示了先前被忽视的多模态与单模态学习目标间的梯度冲突,该冲突可能误导单模态编码器的优化。为有效缓解这些冲突,我们观察到多模态损失与单模态损失之间存在差异:较易学习的多模态损失的梯度幅值和协方差均小于单模态损失。基于此特性,我们分析了多模态场景下的帕累托整合,并提出MMPareto算法。该算法能确保最终梯度方向对所有学习目标具有共性,并通过增强梯度幅值以提高泛化能力,从而提供无害的单模态辅助。最后,在多种模态类型及具有密集跨模态交互的框架上进行的实验表明,我们的方法具有优越且可扩展的性能。该方法亦有望促进任务难度存在明显差异的多任务场景,展现了其理想的扩展性。源代码与数据集可在 https://github.com/GeWu-Lab/MMPareto_ICML2024 获取。