One primary topic of multi-modal learning is to jointly incorporate heterogeneous information from different modalities. However, most models often suffer from unsatisfactory multi-modal cooperation, which could not jointly utilize all modalities well. Some methods are proposed to identify and enhance the worse learnt modality, but are often hard to provide the fine-grained observation of multi-modal cooperation at sample-level with theoretical support. Hence, it is essential to reasonably observe and improve the fine-grained cooperation between modalities, especially when facing realistic scenarios where the modality discrepancy could vary across different samples. To this end, we introduce a fine-grained modality valuation metric to evaluate the contribution of each modality at sample-level. Via modality valuation, we regretfully observe that the multi-modal model tends to rely on one specific modality, resulting in other modalities being low-contributing. We further analyze this issue and improve cooperation between modalities by enhancing the discriminative ability of low-contributing modalities in a targeted manner. Overall, our methods reasonably observe the fine-grained uni-modal contribution at sample-level and achieve considerable improvement on different multi-modal models.
翻译:多模态学习的一个核心议题是联合整合来自不同模态的异构信息。然而,大多数模型往往难以实现满意的多模态协作,无法充分利用所有模态。已有方法被提出用于识别并增强学习效果较差的模态,但这些方法通常难以在样本层面提供具有理论支撑的细粒度多模态协作观测。因此,合理观察并改进模态间的细粒度协作至关重要,尤其是在面对不同样本中模态差异可能变化的现实场景时。为实现这一目标,我们引入了一种细粒度模态估值指标,用于在样本层面评估各模态的贡献。通过模态估值,我们遗憾地观察到多模态模型倾向于依赖某一特定模态,导致其他模态贡献较低。我们进一步分析了这一问题,并通过有针对性地增强低贡献模态的判别能力来改进模态间的协作。总体而言,我们的方法在样本层面合理观测了细粒度单模态贡献,并在不同多模态模型上取得了显著改进。