Contrastive learning has emerged as an efficient framework to learn multimodal representations. CLIP, a seminal work in this area, achieved impressive results by training on paired image-text data using the contrastive loss. Recent work claims improvements over CLIP using additional non-contrastive losses inspired from self-supervised learning. However, it is sometimes hard to disentangle the contribution of these additional losses from other implementation details, e.g., data augmentation or regularization techniques, used to train the model. To shed light on this matter, in this paper, we first propose, implement and evaluate several baselines obtained by combining contrastive learning with recent advances in self-supervised learning. In particular, we use the loss functions that were proven successful for visual self-supervised learning to align image and text modalities. We find that these baselines outperform a basic implementation of CLIP. However, when a stronger training recipe is employed, the advantage disappears. Indeed, we find that a simple CLIP baseline can also be improved substantially, up to a 25% relative improvement on downstream zero-shot tasks, by using well-known training techniques that are popular in other subfields. Moreover, we discover that it is enough to apply image and text augmentations to make up for most of the improvement attained by prior works. With our improved training recipe for CLIP, we obtain state-of-the-art performance on four standard datasets, and consistently outperform prior work (up to +4% on the largest dataset), while being substantially simpler.
翻译:对比学习已成为学习多模态表示的高效框架。该领域的开创性工作CLIP通过使用对比损失在成对图像-文本数据上训练取得了显著成果。近期研究声称通过引入受自监督学习启发的额外非对比损失改进了CLIP。然而,这些额外损失与训练模型所用的其他实现细节(如数据增强或正则化技术)之间的贡献往往难以区分。为阐明这一问题,本文首先提出、实现并评估了若干结合对比学习与自监督学习最新进展的基线方法。具体而言,我们使用在视觉自监督学习中被证明有效的损失函数来对齐图像与文本模态。我们发现这些基线方法优于CLIP的基本实现。但当采用更强的训练策略时,这一优势便消失了。事实上,我们发现通过采用其他子领域广泛使用的经典训练技术,简单的CLIP基线也能得到显著提升——在下游零样本任务上相对改进幅度高达25%。此外,我们发现仅需应用图像与文本增强就能实现先前工作取得的大部分改进。通过我们改进的CLIP训练策略,我们在四个标准数据集上取得了最先进性能,并在保持显著简洁性的同时持续超越先前工作(最大数据集中提升达+4%)。