Vision-and-language models trained to match images with text can be combined with visual explanation methods to point to the locations of specific objects in an image. Our work shows that the localization --"grounding"-- abilities of these models can be further improved by finetuning for self-consistent visual explanations. We propose a strategy for augmenting existing text-image datasets with paraphrases using a large language model, and SelfEQ, a weakly-supervised strategy on visual explanation maps for paraphrases that encourages self-consistency. Specifically, for an input textual phrase, we attempt to generate a paraphrase and finetune the model so that the phrase and paraphrase map to the same region in the image. We posit that this both expands the vocabulary that the model is able to handle, and improves the quality of the object locations highlighted by gradient-based visual explanation methods (e.g. GradCAM). We demonstrate that SelfEQ improves performance on Flickr30k, ReferIt, and RefCOCO+ over a strong baseline method and several prior works. Particularly, comparing to other methods that do not use any type of box annotations, we obtain 84.07% on Flickr30k (an absolute improvement of 4.69%), 67.40% on ReferIt (an absolute improvement of 7.68%), and 75.10%, 55.49% on RefCOCO+ test sets A and B respectively (an absolute improvement of 3.74% on average).
翻译:用于匹配图像与文本的视觉-语言模型可与视觉解释方法相结合,定位图像中特定对象的位置。本研究表明,通过微调实现自一致的视觉解释,可进一步提升这些模型的定位(即“视觉定位”)能力。我们提出了一种利用大语言模型扩充现有文本-图像数据集(通过生成释义)的策略,以及一种针对释义的视觉解释图的弱监督策略SelfEQ,该策略通过鼓励自一致性来提升性能。具体而言,对于输入的文本短语,我们尝试生成其释义并微调模型,使得原短语与释义能够映射到图像中的同一区域。我们认为这不仅扩展了模型能处理的词汇范围,还提升了基于梯度的视觉解释方法(如GradCAM)所高亮的目标位置质量。实验表明,与强基线方法和多项先前工作相比,SelfEQ在Flickr30k、ReferIt和RefCOCO+数据集上均取得性能提升。尤其值得注意的是,与不使用任何边界框标注的其他方法相比,我们在Flickr30k、ReferIt上分别达到84.07%(绝对提升4.69%)和67.40%(绝对提升7.68%),在RefCOCO+测试集A和B上分别达到75.10%和55.49%(平均绝对提升3.74%)。