Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance across several tasks. Such models excel at object-centric recognition yet learn text representations that seem invariant to word order, failing to compose known concepts in novel ways. However, no evidence exists that any VLM, including large-scale single-stream models such as GPT-4V, identifies compositions successfully. In this paper, we introduce a framework to significantly improve the ability of existing models to encode compositional language, with over 10% absolute improvement on compositionality benchmarks, while maintaining or improving the performance on standard object-recognition and retrieval benchmarks. Our code and pre-trained models are publicly available at https://github.com/netflix/clove.
翻译:近年来,视觉与语言任务的性能取得了显著提升。基础视觉-语言模型(VLMs),如CLIP,已在多种场景中应用,并在多项任务中展现出卓越性能。这类模型擅长以对象为中心的识别,但所学得的文本表示似乎对词序不变,无法以新颖方式组合已知概念。然而,尚无证据表明任何VLM(包括GPT-4V等大规模单流模型)能够成功识别组合。本文提出一种框架,显著提升现有模型编码组合语言的能力,在组合性基准测试上取得超过10%的绝对改进,同时保持或提升标准对象识别与检索基准的性能。我们的代码与预训练模型已在https://github.com/netflix/clove 公开提供。