Improved Visual Grounding through Self-Consistent Explanations

Vision-and-language models trained to match images with text can be combined with visual explanation methods to point to the locations of specific objects in an image. Our work shows that the localization --"grounding"-- abilities of these models can be further improved by finetuning for self-consistent visual explanations. We propose a strategy for augmenting existing text-image datasets with paraphrases using a large language model, and SelfEQ, a weakly-supervised strategy on visual explanation maps for paraphrases that encourages self-consistency. Specifically, for an input textual phrase, we attempt to generate a paraphrase and finetune the model so that the phrase and paraphrase map to the same region in the image. We posit that this both expands the vocabulary that the model is able to handle, and improves the quality of the object locations highlighted by gradient-based visual explanation methods (e.g. GradCAM). We demonstrate that SelfEQ improves performance on Flickr30k, ReferIt, and RefCOCO+ over a strong baseline method and several prior works. Particularly, comparing to other methods that do not use any type of box annotations, we obtain 84.07% on Flickr30k (an absolute improvement of 4.69%), 67.40% on ReferIt (an absolute improvement of 7.68%), and 75.10%, 55.49% on RefCOCO+ test sets A and B respectively (an absolute improvement of 3.74% on average).

翻译：用于匹配图像与文本的视觉-语言模型可与视觉解释方法相结合，定位图像中特定对象的位置。本研究表明，通过微调实现自一致的视觉解释，可进一步提升这些模型的定位（即“视觉定位”）能力。我们提出了一种利用大语言模型扩充现有文本-图像数据集（通过生成释义）的策略，以及一种针对释义的视觉解释图的弱监督策略SelfEQ，该策略通过鼓励自一致性来提升性能。具体而言，对于输入的文本短语，我们尝试生成其释义并微调模型，使得原短语与释义能够映射到图像中的同一区域。我们认为这不仅扩展了模型能处理的词汇范围，还提升了基于梯度的视觉解释方法（如GradCAM）所高亮的目标位置质量。实验表明，与强基线方法和多项先前工作相比，SelfEQ在Flickr30k、ReferIt和RefCOCO+数据集上均取得性能提升。尤其值得注意的是，与不使用任何边界框标注的其他方法相比，我们在Flickr30k、ReferIt上分别达到84.07%（绝对提升4.69%）和67.40%（绝对提升7.68%），在RefCOCO+测试集A和B上分别达到75.10%和55.49%（平均绝对提升3.74%）。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日