Multimodal learning aims to imitate human beings to acquire complementary information from multiple modalities for various downstream tasks. However, traditional aggregation-based multimodal fusion methods ignore the inter-modality relationship, treat each modality equally, suffer sensor noise, and thus reduce multimodal learning performance. In this work, we propose a novel multimodal contrastive method to explore more reliable multimodal representations under the weak supervision of unimodal predicting. Specifically, we first capture task-related unimodal representations and the unimodal predictions from the introduced unimodal predicting task. Then the unimodal representations are aligned with the more effective one by the designed multimodal contrastive method under the supervision of the unimodal predictions. Experimental results with fused features on two image-text classification benchmarks UPMC-Food-101 and N24News show that our proposed Unimodality-Supervised MultiModal Contrastive UniS-MMC learning method outperforms current state-of-the-art multimodal methods. The detailed ablation study and analysis further demonstrate the advantage of our proposed method.
翻译:多模态学习旨在模仿人类从多种模态中获取互补信息,以完成各类下游任务。然而,传统的基于聚合的多模态融合方法忽略了模态间关系,平等对待每种模态,易受传感器噪声影响,从而降低了多模态学习性能。本文提出了一种新颖的多模态对比方法,在单模态预测的弱监督下探索更可靠的多模态表示。具体而言,我们首先从引入的单模态预测任务中获取与任务相关的单模态表示和单模态预测结果。随后,在单模态预测的监督下,通过设计的多模态对比方法将单模态表示与更有效的表示进行对齐。在两个图像-文本分类基准数据集(UPMC-Food-101和N24News)上基于融合特征的实验结果表明,我们提出的单模态监督多模态对比学习(UniS-MMC)方法优于当前最先进的多模态方法。详细的消融实验与分析进一步证明了所提方法的优势。