Visual imagery does not consist of solitary objects, but instead reflects the composition of a multitude of fluid concepts. While there have been great advances in visual representation learning, such advances have focused on building better representations for a small number of discrete objects bereft of an understanding of how these objects are interacting. One can observe this limitation in representations learned through captions or contrastive learning -- where the learned model treats an image essentially as a bag of words. Several works have attempted to address this limitation through the development of bespoke learned architectures to directly address the shortcomings in compositional learning. In this work, we focus on simple, and scalable approaches. In particular, we demonstrate that by substantially improving weakly labeled data, i.e. captions, we can vastly improve the performance of standard contrastive learning approaches. Previous CLIP models achieved near chance rate on challenging tasks probing compositional learning. However, our simple approach boosts performance of CLIP substantially and surpasses all bespoke architectures. Furthermore, we showcase our results on a relatively new captioning benchmark derived from DOCCI. We demonstrate through a series of ablations that a standard CLIP model trained with enhanced data may demonstrate impressive performance on image retrieval tasks.
翻译:视觉图像并非由孤立物体构成,而是反映了众多流动概念的组合。尽管视觉表征学习已取得重大进展,但这些进展主要集中于为少量离散物体构建更好的表征,缺乏对这些物体间交互方式的理解。我们可以从通过字幕或对比学习获得的表征中观察到这一局限性——学习模型本质上将图像视为词袋。已有若干研究尝试通过开发定制化学习架构来直接解决组合学习中的缺陷。本研究聚焦于简单且可扩展的方法。具体而言,我们证明通过大幅改进弱标注数据(即图像描述文本),能够极大提升标准对比学习方法的性能。先前CLIP模型在挑战组合学习的任务中表现接近随机水平,而我们提出的简单方法显著提升了CLIP的性能,并超越了所有定制化架构。此外,我们在基于DOCCI构建的新型字幕基准测试中展示了实验结果。通过一系列消融实验,我们证明使用增强数据训练的标准CLIP模型在图像检索任务中可展现出卓越性能。