One primary topic of multimodal learning is to jointly incorporate heterogeneous information from different modalities. However, most models often suffer from unsatisfactory multimodal cooperation, which cannot jointly utilize all modalities well. Some methods are proposed to identify and enhance the worse learnt modality, but they are often hard to provide the fine-grained observation of multimodal cooperation at sample-level with theoretical support. Hence, it is essential to reasonably observe and improve the fine-grained cooperation between modalities, especially when facing realistic scenarios where the modality discrepancy could vary across different samples. To this end, we introduce a sample-level modality valuation metric to evaluate the contribution of each modality for each sample. Via modality valuation, we observe that modality discrepancy indeed could be different at sample-level, beyond the global contribution discrepancy at dataset-level. We further analyze this issue and improve cooperation between modalities at sample-level by enhancing the discriminative ability of low-contributing modalities in a targeted manner. Overall, our methods reasonably observe the fine-grained uni-modal contribution and achieve considerable improvement. The source code and dataset are available at \url{https://github.com/GeWu-Lab/Valuate-and-Enhance-Multimodal-Cooperation}.
翻译:多模态学习的一个核心议题是如何联合整合来自不同模态的异构信息。然而,大多数模型常因多模态协作不佳而难以充分利用所有模态。现有方法虽能识别并增强学习较差的模态,但往往难以在样本层面提供具备理论支撑的细粒度多模态协作观察。因此,在模态差异可能随样本变化的现实场景中,合理观察并提升模态间的细粒度协作至关重要。为此,我们提出一种样本级模态估值指标,用于评估每个样本中各模态的贡献。通过模态估值,我们观察到模态差异不仅存在于数据集层面的全局贡献差异,更可能体现在样本级层面。我们进一步分析该问题,并通过针对性增强低贡献模态的判别能力,在样本级层面改善模态间的协作。总体而言,本方法实现了对细粒度单模态贡献的合理观测,并取得了显著的性能提升。源代码与数据集已开源至 \url{https://github.com/GeWu-Lab/Valuate-and-Enhance-Multimodal-Cooperation}。