Pre-trained multi-modal models, such as CLIP, provide transferable embeddings and show promising results in diverse applications. However, the analysis of learned multi-modal embeddings is relatively unexplored, and the embedding transferability can be improved. In this work, we observe that CLIP holds separated embedding subspaces for two different modalities, and then we investigate it through the lens of uniformity-alignment to measure the quality of learned representation. Both theoretically and empirically, we show that CLIP retains poor uniformity and alignment even after fine-tuning. Such a lack of alignment and uniformity might restrict the transferability and robustness of embeddings. To this end, we devise a new fine-tuning method for robust representation equipping better alignment and uniformity. First, we propose a Geodesic Multi-Modal Mixup that mixes the embeddings of image and text to generate hard negative samples on the hypersphere. Then, we fine-tune the model on hard negatives as well as original negatives and positives with contrastive loss. Based on the theoretical analysis about hardness guarantee and limiting behavior, we justify the use of our method. Extensive experiments on retrieval, calibration, few- or zero-shot classification (under distribution shift), embedding arithmetic, and image captioning further show that our method provides transferable representations, enabling robust model adaptation on diverse tasks. Code: https://github.com/changdaeoh/multimodal-mixup
翻译:预训练的多模态模型(如CLIP)能够提供可迁移的嵌入表示,并在各类应用中展现出优异性能。然而,关于学习到的多模态嵌入的分析仍相对匮乏,且嵌入的可迁移性有待提升。本工作中,我们观察到CLIP为两种不同模态保留了分离的嵌入子空间,并进一步通过均匀性-对齐性视角评估学习表征的质量。理论与实验均表明,即使经过微调,CLIP仍存在较差的均匀性与对齐性。这种对齐与均匀性的缺失可能限制嵌入的可迁移性与鲁棒性。为此,我们提出一种新的微调方法,以构建具有更好对齐性与均匀性的鲁棒表征。首先,我们提出测地线多模态混合方法,在超球面上混合图像与文本嵌入以生成难负样本;随后,利用对比损失在原始负样本、正样本及这些难负样本上对模型进行微调。通过关于难度保证与极限行为的理论分析,我们论证了该方法的合理性。检索、校准、少样本/零样本分类(含分布偏移下的)、嵌入算术及图像描述等广泛实验表明,我们的方法能够提供可迁移的表征,使模型在不同任务上实现鲁棒自适应。代码地址:https://github.com/changdaeoh/multimodal-mixup