Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions

Visual grounding (VG) aims at locating the foreground entities that match the given natural language expressions. Previous datasets and methods for classic VG task mainly rely on the prior assumption that the given expression must literally refer to the target object, which greatly impedes the practical deployment of agents in real-world scenarios. Since users usually prefer to provide intention-based expression for the desired object instead of covering all the details, it is necessary for the agents to interpret the intention-driven instructions. Thus, in this work, we take a step further to the intention-driven visual-language (V-L) understanding. To promote classic VG towards human intention interpretation, we propose a new intention-driven visual grounding (IVG) task and build a large-scale IVG dataset termed IntentionVG with free-form intention expressions. Considering that practical agents need to move and find specific targets among various scenarios to realize the grounding task, our IVG task and IntentionVG dataset have taken the crucial properties of both multi-scenario perception and egocentric view into consideration. Besides, various types of models are set up as the baselines to realize our IVG task. Extensive experiments on our IntentionVG dataset and baselines demonstrate the necessity and efficacy of our method for the V-L field. To foster future research in this direction, our newly built dataset and baselines will be publicly available at https://github.com/Rubics-Xuan/IVG.

翻译：视觉定位（VG）的目标是定位与给定自然语言描述匹配的前景实体。经典VG任务的传统数据集和方法主要依赖于给定描述必须字面指代目标对象的先验假设，这严重阻碍了智能体在实际场景中的部署应用。由于用户通常倾向于提供基于意图的描述来指定目标对象，而非涵盖所有细节，智能体需要具备解析意图驱动指令的能力。因此，本研究向意图驱动的视觉-语言（V-L）理解迈出了探索性的一步。为推进经典VG任务向人类意图理解方向发展，我们提出了新型的意图驱动视觉定位（IVG）任务，并构建了包含自由形式意图描述的大规模IVG数据集IntentionVG。考虑到实际智能体需要在多场景中移动并寻找特定目标以实现定位任务，我们的IVG任务与IntentionVG数据集同时兼顾了多场景感知与第一人称视角的关键特性。此外，我们建立了多种类型的基线模型来实现IVG任务。在IntentionVG数据集和基线模型上的大量实验证明了本方法对V-L研究领域的必要性与有效性。为促进该方向的后续研究，我们新建的数据集与基线模型将在https://github.com/Rubics-Xuan/IVG公开提供。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日