Contrastive loss has been increasingly used in learning representations from multiple modalities. In the limit, the nature of the contrastive loss encourages modalities to exactly match each other in the latent space. Yet it remains an open question how the modality alignment affects the downstream task performance. In this paper, based on an information-theoretic argument, we first prove that exact modality alignment is sub-optimal in general for downstream prediction tasks. Hence we advocate that the key of better performance lies in meaningful latent modality structures instead of perfect modality alignment. To this end, we propose three general approaches to construct latent modality structures. Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization. Extensive experiments are conducted on two popular multi-modal representation learning frameworks: the CLIP-based two-tower model and the ALBEF-based fusion model. We test our model on a variety of tasks including zero/few-shot image classification, image-text retrieval, visual question answering, visual reasoning, and visual entailment. Our method achieves consistent improvements over existing methods, demonstrating the effectiveness and generalizability of our proposed approach on latent modality structure regularization.
翻译:对比损失越来越多地被用于从多种模态中学习表示。在极限情况下,对比损失的本质会促使模态在潜在空间中彼此精确匹配。然而,模态对齐如何影响下游任务性能仍是一个悬而未决的问题。本文基于信息论论证,首先证明了精确模态对齐通常对下游预测任务是次优的。因此,我们主张提升性能的关键在于有意义的潜在模态结构,而非完美的模态对齐。为此,我们提出了三种构建潜在模态结构的通用方法。具体而言,我们设计了:1)用于模态内正则化的深度特征分离损失;2)用于模态间正则化的布朗桥损失;3)同时用于模态内和模态间正则化的几何一致性损失。我们在两种主流的多模态表示学习框架——基于CLIP的双塔模型和基于ALBEF的融合模型上进行了大量实验。我们在包括零样本/少样本图像分类、图像-文本检索、视觉问答、视觉推理和视觉蕴含在内的多种任务上测试了模型。我们的方法在现有方法基础上取得了一致的改进,证明了所提出的潜在模态结构正则化方法的有效性和泛化能力。