Existing Vision-Language Compositionality (VLC) benchmarks like SugarCrepe are formulated as image-to-text retrieval problems, where, given an image, the models need to select between the correct textual description and a synthetic hard negative text. In this work, we present the Bidirectional Vision-Language Compositionality (BiVLC) dataset. The novelty of BiVLC is to add a synthetic hard negative image generated from the synthetic text, resulting in two image-to-text retrieval examples (one for each image) and, more importantly, two text-to-image retrieval examples (one for each text). Human annotators filter out ill-formed examples ensuring the validity of the benchmark. The experiments on BiVLC uncover a weakness of current multimodal models, as they perform poorly in the text-to-image direction. In fact, when considering both retrieval directions, the conclusions obtained in previous works change significantly. In addition to the benchmark, we show that a contrastive model trained using synthetic images and texts significantly improves over the base model in SugarCrepe and in BiVLC for both retrieval directions. The gap to human performance in BiVLC confirms that Vision-Language Compositionality is still a challenging problem. BiVLC and code are available at https://imirandam.github.io/BiVLC_project_page.
翻译:现有的视觉语言组合性(VLC)基准测试(如SugarCrepe)被构建为图像到文本的检索问题:给定一张图像,模型需要在正确的文本描述与一个合成的困难负例文本之间进行选择。在本工作中,我们提出了双向视觉语言组合性(BiVLC)数据集。BiVLC的创新之处在于,通过从合成文本生成一个合成的困难负例图像,从而产生两个图像到文本检索样本(每个图像对应一个)以及——更重要的是——两个文本到图像检索样本(每个文本对应一个)。人工标注者会筛选掉构造不当的样本,以确保基准测试的有效性。在BiVLC上的实验揭示了当前多模态模型的一个弱点:它们在文本到图像方向上的表现不佳。事实上,当同时考虑两个检索方向时,先前工作中得出的结论会发生显著变化。除了基准测试外,我们还展示了使用合成图像和文本进行训练的对比模型,在SugarCrepe和BiVLC的两个检索方向上均显著优于基础模型。BiVLC上与人类表现的差距证实,视觉语言组合性仍然是一个具有挑战性的问题。BiVLC及代码可通过 https://imirandam.github.io/BiVLC_project_page 获取。