While the field of multi-modal learning keeps growing fast, the deficiency of the standard joint training paradigm has become clear through recent studies. They attribute the sub-optimal performance of the jointly trained model to the modality competition phenomenon. Existing works attempt to improve the jointly trained model by modulating the training process. Despite their effectiveness, those methods can only apply to late fusion models. More importantly, the mechanism of the modality competition remains unexplored. In this paper, we first propose an adaptive gradient modulation method that can boost the performance of multi-modal models with various fusion strategies. Extensive experiments show that our method surpasses all existing modulation methods. Furthermore, to have a quantitative understanding of the modality competition and the mechanism behind the effectiveness of our modulation method, we introduce a novel metric to measure the competition strength. This metric is built on the mono-modal concept, a function that is designed to represent the competition-less state of a modality. Through systematic investigation, our results confirm the intuition that the modulation encourages the model to rely on the more informative modality. In addition, we find that the jointly trained model typically has a preferred modality on which the competition is weaker than other modalities. However, this preferred modality need not dominate others. Our code will be available at https://github.com/lihong2303/AGM_ICCV2023.
翻译:尽管多模态学习领域持续快速发展,但近期研究已揭示了标准联合训练范式的缺陷,指出模态竞争现象是导致联合训练模型性能次优的根本原因。现有研究通过调节训练过程来改进联合训练模型,虽然取得一定效果,但这些方法仅适用于后期融合模型,更重要的是模态竞争的内在机制尚未被探索。本文首先提出一种自适应梯度调制方法,该方法能够提升采用不同融合策略的多模态模型性能。大量实验表明,我们的方法优于所有现有调制方法。此外,为定量理解模态竞争现象及调制方法有效性的内在机制,我们引入了一种新的度量指标来量化竞争强度。该指标基于单模态概念构建——该函数旨在表征模态无竞争状态。通过系统研究,我们的结果证实了直觉认知:调制促使模型依赖信息更丰富的模态。同时发现联合训练模型通常存在一个竞争强度低于其他模态的偏好模态,但该偏好模态无需主导其他模态。我们的代码已开源至 https://github.com/lihong2303/AGM_ICCV2023。