Learning good representations involves capturing the diverse ways in which data samples relate. Contrastive loss - an objective matching related samples - underlies methods from self-supervised to multimodal learning. Contrastive losses, however, can be viewed more broadly as modifying a similarity graph to indicate how samples should relate in the embedding space. This view reveals a shortcoming in contrastive learning: the similarity graph is binary, as only one sample is the related positive sample. Crucially, similarities \textit{across} samples are ignored. Based on this observation, we revise the standard contrastive loss to explicitly encode how a sample relates to others. We experiment with this new objective, called $\mathbb{X}$-Sample Contrastive, to train vision models based on similarities in class or text caption descriptions. Our study spans three scales: ImageNet-1k with 1 million, CC3M with 3 million, and CC12M with 12 million samples. The representations learned via our objective outperform both contrastive self-supervised and vision-language models trained on the same data across a range of tasks. When training on CC12M, we outperform CLIP by $0.6\%$ on both ImageNet and ImageNet Real. Our objective appears to work particularly well in lower-data regimes, with gains over CLIP of $16.8\%$ on ImageNet and $18.1\%$ on ImageNet Real when training with CC3M. Finally, our objective seems to encourage the model to learn representations that separate objects from their attributes and backgrounds, with gains of $3.3$-$5.6$\% over CLIP on ImageNet9. We hope the proposed solution takes a small step towards developing richer learning objectives for understanding sample relations in foundation models.
翻译:学习优质表征需要捕捉数据样本间多样化的关联方式。对比损失——一种匹配相关样本的目标函数——构成了从自监督学习到多模态学习方法的基础。然而,对比损失可被更广义地视为修改相似性图,以指示样本在嵌入空间中应如何关联。这一视角揭示了对比学习的一个缺陷:相似性图是二元的,因为仅有一个样本作为相关正样本。关键在于,样本间的相似性被完全忽略。基于此观察,我们修正了标准对比损失,以显式编码样本与其他样本的关联方式。我们通过名为$\mathbb{X}$-样本对比的新目标函数进行实验,基于类别或文本描述相似性训练视觉模型。我们的研究涵盖三个规模:包含100万样本的ImageNet-1k、包含300万样本的CC3M以及包含1200万样本的CC12M。通过该目标函数学习的表征在一系列任务上均优于基于相同数据训练的对比自监督模型和视觉语言模型。在CC12M上训练时,我们在ImageNet和ImageNet Real上的性能分别超越CLIP模型0.6%。该目标函数在低数据量场景下表现尤为突出,使用CC3M训练时,在ImageNet和ImageNet Real上分别获得16.8%和18.1%的性能提升。最后,该目标函数能促使模型学习将对象与其属性和背景分离的表征,在ImageNet9数据集上相比CLIP提升3.3%-5.6%。我们希望所提出的方法能为开发更丰富的学习目标以理解基础模型中的样本关系迈出微小的一步。