One primary topic of multimodal learning is to jointly incorporate heterogeneous information from different modalities. However most models often suffer from unsatisfactory multimodal cooperation which cannot jointly utilize all modalities well. Some methods are proposed to identify and enhance the worse learnt modality but they are often hard to provide the fine-grained observation of multimodal cooperation at sample-level with theoretical support. Hence it is essential to reasonably observe and improve the fine-grained cooperation between modalities especially when facing realistic scenarios where the modality discrepancy could vary across different samples. To this end we introduce a sample-level modality valuation metric to evaluate the contribution of each modality for each sample. Via modality valuation we observe that modality discrepancy indeed could be different at sample-level beyond the global contribution discrepancy at dataset-level. We further analyze this issue and improve cooperation between modalities at sample-level by enhancing the discriminative ability of low-contributing modalities in a targeted manner. Overall our methods reasonably observe the fine-grained uni-modal contribution and achieve considerable improvement. The source code and dataset are available at https://github.com/GeWu-Lab/Valuate-and-Enhance-Multimodal-Cooperation.
翻译:多模态学习的一个核心议题是联合整合来自不同模态的异质信息。然而,大多数模型常受限于多模态合作效果不佳,无法充分利用所有模态。已有方法尝试识别并增强学习效果较差的模态,但这些方法往往难以在理论支持下提供样本级的细粒度多模态合作观测。因此,合理观察并改进模态间的细粒度合作至关重要,尤其是在面对现实场景时——不同样本间的模态差异可能各不相同。为此,我们提出了一种样本级模态价值评估指标,用于量化每个模态对单个样本的贡献度。通过模态价值评估,我们发现模态差异确实可在样本层面呈现不同特性,而不仅限于数据集层面的全局贡献差异。我们进一步分析了这一问题,并通过针对性增强低贡献模态的判别能力,在样本层面改进了模态间的合作。总体而言,我们的方法合理观测了细粒度的单模态贡献,并取得了显著提升。源代码与数据集已公开于 https://github.com/GeWu-Lab/Valuate-and-Enhance-Multimodal-Cooperation。