Although an object may appear in numerous contexts, we often describe it in a limited number of ways. This happens because language abstracts away visual variation to represent and communicate concepts. Building on this intuition, we propose an alternative approach to visual learning: using language similarity to sample semantically similar image pairs for contrastive learning. Our approach deviates from image-based contrastive learning by using language to sample pairs instead of hand-crafted augmentations or learned clusters. Our approach also deviates from image-text contrastive learning by relying on pre-trained language models to guide the learning rather than minimize a cross-modal similarity. Through a series of experiments, we show that language-guided learning can learn better features than both image-image and image-text representation learning approaches.
翻译:尽管一个物体可能出现在多种上下文中,我们通常仅用有限的方式描述它。这是因为语言通过抽象化视觉变化来表征和传达概念。基于这一直觉,我们提出了一种替代性视觉学习方法:利用语言相似性为对比学习采样语义相似的图像对。我们的方法背离了基于图像的对比学习,因为它使用语言来采样对,而非手工设计的增强或学习到的聚类。同时,我们的方法也区别于图像-文本对比学习,因为它依赖预训练语言模型来引导学习,而非最小化跨模态相似性。通过一系列实验,我们证明了语言引导的学习能够比图像-图像和图像-文本表征学习方法学习到更优的特征。