Weakly supervised visual grounding aims to predict the region in an image that corresponds to a specific linguistic query, where the mapping between the target object and query is unknown in the training stage. The state-of-the-art method uses a vision language pre-training model to acquire heatmaps from Grad-CAM, which matches every query word with an image region, and uses the combined heatmap to rank the region proposals. In this paper, we propose two simple but efficient methods for improving this approach. First, we propose a target-aware cropping approach to encourage the model to learn both object and scene level semantic representations. Second, we apply dependency parsing to extract words related to the target object, and then put emphasis on these words in the heatmap combination. Our method surpasses the previous SOTA methods on RefCOCO, RefCOCO+, and RefCOCOg by a notable margin.
翻译:弱监督视觉定位旨在预测图像中对应于特定语言查询的区域,其中目标对象与查询之间的映射在训练阶段是未知的。当前最先进的方法使用视觉语言预训练模型从Grad-CAM获取热力图,将每个查询词与图像区域匹配,并利用组合热力图对区域候选进行排序。本文提出了两种简单而有效的方法来改进该技术:首先,我们提出一种目标感知的图像裁剪方法,以鼓励模型同时学习对象级和场景级的语义表征;其次,我们应用依存句法分析提取与目标对象相关的词汇,并在热力图组合中加强这些词汇的权重。我们的方法在RefCOCO、RefCOCO+和RefCOCOg数据集上显著超越了之前的SOTA方法。