Pre-trained multi-modal models, such as CLIP, provide transferable embeddings and show promising results in diverse applications. However, the analysis of learned multi-modal embeddings is relatively unexplored, and the embedding transferability can be improved. In this work, we observe that CLIP holds separated embedding subspaces for two different modalities, and then we investigate it through the lens of uniformity-alignment to measure the quality of learned representation. Both theoretically and empirically, we show that CLIP retains poor uniformity and alignment even after fine-tuning. Such a lack of alignment and uniformity might restrict the transferability and robustness of embeddings. To this end, we devise a new fine-tuning method for robust representation equipping better alignment and uniformity. First, we propose a Geodesic Multi-Modal Mixup that mixes the embeddings of image and text to generate hard negative samples on the hypersphere. Then, we fine-tune the model on hard negatives as well as original negatives and positives with contrastive loss. Based on the theoretical analysis about hardness guarantee and limiting behavior, we justify the use of our method. Extensive experiments on retrieval, calibration, few- or zero-shot classification (under distribution shift), embedding arithmetic, and image captioning further show that our method provides transferable representations, enabling robust model adaptation on diverse tasks. Code: https://github.com/changdaeoh/multimodal-mixup
翻译:预训练的多模态模型(如CLIP)能够提供可迁移的嵌入表示,并在多种应用中展现出优异效果。然而,针对所学多模态嵌入的分析仍相对不足,且嵌入的可迁移性尚有待提升。本工作中,我们观察到CLIP为两种不同模态保留了分离的嵌入子空间,并进一步通过均匀性-对齐性这一视角来评估所学表示的质量。理论分析与实验验证均表明,CLIP即使在微调后仍存在均匀性与对齐性不足的问题,这种缺陷可能限制嵌入的可迁移性与鲁棒性。为此,我们设计了一种新型微调方法以获取兼具更好对齐性与均匀性的鲁棒表示。首先,我们提出测地线多模态混合方法,在超球面上对图像与文本的嵌入进行混合以生成困难负样本;随后,通过对比损失对模型进行微调,同时处理原始负/正样本与生成的困难负样本。基于困难度保证与极限行为的理论分析,我们论证了该方法的有效性。在检索、校准、少样本/零样本分类(含分布偏移场景)、嵌入算术及图像描述任务上的广泛实验表明,该方法能提供可迁移的表示,使模型可在多种任务下实现鲁棒适配。代码地址:https://github.com/changdaeoh/multimodal-mixup