Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Ideally, it can be a testbed for vision-and-language models to evaluate their understanding of the images and texts and their reasoning abilities over their joint space. However, most existing VG datasets are constructed using simple description texts, which do not require sufficient reasoning over the images and texts. This has been demonstrated in a recent study~\cite{luo2022goes}, where a simple LSTM-based text encoder without pretraining can achieve state-of-the-art performance on mainstream VG datasets. Therefore, in this paper, we propose a novel benchmark of \underline{S}cene \underline{K}nowledge-guided \underline{V}isual \underline{G}rounding (SK-VG), where the image content and referring expressions are not sufficient to ground the target objects, forcing the models to have a reasoning ability on the long-form scene knowledge. To perform this task, we propose two approaches to accept the triple-type input, where the former embeds knowledge into the image features before the image-query interaction; the latter leverages linguistic structure to assist in computing the image-text matching. We conduct extensive experiments to analyze the above methods and show that the proposed approaches achieve promising results but still leave room for improvement, including performance and interpretability. The dataset and code are available at \url{https://github.com/zhjohnchan/SK-VG}.
翻译:视觉定位(VG)旨在建立视觉与语言之间的细粒度对齐。理想情况下,它可作为视觉-语言模型的测试平台,评估其对图像和文本的理解能力以及跨联合空间的推理能力。然而,现有大多数VG数据集使用简单描述文本构建,无需对图像和文本进行充分推理。近期研究~\cite{luo2022goes}已表明,一个未经过预训练的简单LSTM文本编码器即可在主流VG数据集上达到最先进性能。因此,本文提出一种新型基准——场景知识引导的视觉定位(SK-VG),其中图像内容和指代表达不足以定位目标对象,迫使模型具备对长形式场景知识的推理能力。为完成该任务,我们提出两种接受三元组类型输入的方法:前者在图像-查询交互前将知识嵌入图像特征;后者利用语言结构辅助计算图像-文本匹配。通过广泛实验分析上述方法,表明所提方法虽取得有前景的结果,但在性能与可解释性方面仍有提升空间。数据集与代码见\url{https://github.com/zhjohnchan/SK-VG}。