Visual grounding (VG) aims at locating the foreground entities that match the given natural language expressions. Previous datasets and methods for classic VG task mainly rely on the prior assumption that the given expression must literally refer to the target object, which greatly impedes the practical deployment of agents in real-world scenarios. Since users usually prefer to provide intention-based expression for the desired object instead of covering all the details, it is necessary for the agents to interpret the intention-driven instructions. Thus, in this work, we take a step further to the intention-driven visual-language (V-L) understanding. To promote classic VG towards human intention interpretation, we propose a new intention-driven visual grounding (IVG) task and build a large-scale IVG dataset termed IntentionVG with free-form intention expressions. Considering that practical agents need to move and find specific targets among various scenarios to realize the grounding task, our IVG task and IntentionVG dataset have taken the crucial properties of both multi-scenario perception and egocentric view into consideration. Besides, various types of models are set up as the baselines to realize our IVG task. Extensive experiments on our IntentionVG dataset and baselines demonstrate the necessity and efficacy of our method for the V-L field. To foster future research in this direction, our newly built dataset and baselines will be publicly available at https://github.com/Rubics-Xuan/IVG.
翻译:视觉定位(VG)的目标是定位与给定自然语言描述匹配的前景实体。经典VG任务的传统数据集和方法主要依赖于给定描述必须字面指代目标对象的先验假设,这严重阻碍了智能体在实际场景中的部署应用。由于用户通常倾向于提供基于意图的描述来指定目标对象,而非涵盖所有细节,智能体需要具备解析意图驱动指令的能力。因此,本研究向意图驱动的视觉-语言(V-L)理解迈出了探索性的一步。为推进经典VG任务向人类意图理解方向发展,我们提出了新型的意图驱动视觉定位(IVG)任务,并构建了包含自由形式意图描述的大规模IVG数据集IntentionVG。考虑到实际智能体需要在多场景中移动并寻找特定目标以实现定位任务,我们的IVG任务与IntentionVG数据集同时兼顾了多场景感知与第一人称视角的关键特性。此外,我们建立了多种类型的基线模型来实现IVG任务。在IntentionVG数据集和基线模型上的大量实验证明了本方法对V-L研究领域的必要性与有效性。为促进该方向的后续研究,我们新建的数据集与基线模型将在https://github.com/Rubics-Xuan/IVG公开提供。