Although an object may appear in numerous contexts, we often describe it in a limited number of ways. Language allows us to abstract away visual variation to represent and communicate concepts. Building on this intuition, we propose an alternative approach to visual representation learning: using language similarity to sample semantically similar image pairs for contrastive learning. Our approach diverges from image-based contrastive learning by sampling view pairs using language similarity instead of hand-crafted augmentations or learned clusters. Our approach also differs from image-text contrastive learning by relying on pre-trained language models to guide the learning rather than directly minimizing a cross-modal loss. Through a series of experiments, we show that language-guided learning yields better features than image-based and image-text representation learning approaches.
翻译:尽管物体可能出现在多种不同情境中,我们通常仅用有限的方式描述它们。语言使我们能够抽象视觉变化,以表征和传达概念。基于这一直觉,我们提出一种替代性的视觉表征学习方法:利用语言相似性为对比学习采样语义相似的图像对。我们的方法与基于图像的对比学习不同,通过使用语言相似性而非手工设计的增广方法或学习到的聚类来采样视图对。该方法也与图文对比学习存在差异,它依赖预训练语言模型引导学习过程,而非直接最小化跨模态损失。通过一系列实验,我们证明语言引导的学习方法能够比基于图像和图文表征学习的方法获得更优的特征。