Performant vision-language (VL) models like CLIP represent captions using a single vector. How much information about language is lost in this bottleneck? We first curate CompPrompts, a set of increasingly compositional image captions that VL models should be able to capture (e.g., single object, to object+property, to multiple interacting objects). Then, we train text-only recovery probes that aim to reconstruct captions from single-vector text representations produced by several VL models. This approach doesn't require images, allowing us to test on a broader range of scenes compared to prior work. We find that: 1) CLIP's text encoder falls short on object relationships, attribute-object association, counting, and negations; 2) some text encoders work significantly better than others; and 3) text-only recovery performance predicts multi-modal matching performance on ControlledImCaps: a new evaluation benchmark we collect+release consisting of fine-grained compositional images+captions. Specifically -- our results suggest text-only recoverability is a necessary (but not sufficient) condition for modeling compositional factors in contrastive vision+language models. We release data+code.
翻译:像CLIP这样高性能的视觉-语言模型使用单个向量来表示文本描述。这种瓶颈会导致多少语言信息丢失?我们首先整理出CompPrompts数据集,这是一组视觉-语言模型应当能够捕获的、复杂度递增的图像描述(例如,从单一物体到物体+属性,再到多个相互作用的物体)。随后,我们训练仅基于文本的恢复探针,旨在从多种视觉-语言模型生成的单向量文本表示中重建原始描述。该方法无需图像,使我们可以比以往研究在更广泛的场景中进行测试。研究发现:1)CLIP的文本编码器在物体关系、属性-物体关联、计数和否定表达上表现不足;2)不同文本编码器之间存在显著性能差异;3)仅基于文本的恢复性能可预测模型在ControlledImCaps(我们收集并发布的新型细粒度组合图像+描述评估基准)上的多模态匹配表现。具体而言,我们的结果表明,文本可恢复性是对比视觉-语言模型建模组合要素的必要但非充分条件。我们已公开发布数据与代码。