Interactive Object Grasping (IOG) is the task of identifying and grasping the desired object via human-robot natural language interaction. Current IOG systems assume that a human user initially specifies the target object's category (e.g., bottle). Inspired by pragmatics, where humans often convey their intentions by relying on context to achieve goals, we introduce a new IOG task, Pragmatic-IOG, and the corresponding dataset, Intention-oriented Multi-modal Dialogue (IM-Dial). In our proposed task scenario, an intention-oriented utterance (e.g., "I am thirsty") is initially given to the robot. The robot should then identify the target object by interacting with a human user. Based on the task setup, we propose a new robotic system that can interpret the user's intention and pick up the target object, Pragmatic Object Grasping (PROGrasp). PROGrasp performs Pragmatic-IOG by incorporating modules for visual grounding, question asking, object grasping, and most importantly, answer interpretation for pragmatic inference. Experimental results show that PROGrasp is effective in offline (i.e., target object discovery) and online (i.e., IOG with a physical robot arm) settings.
翻译:交互式物体抓取(IOG)是指通过人机自然语言交互来识别并抓取目标物体的任务。现有IOG系统假设人类用户初始指定目标物体的类别(如瓶子)。受语用学启发——人类常通过依赖语境传达意图以达成目标,我们提出一项新的IOG任务,即语用性IOG(Pragmatic-IOG),以及相应的数据集,即意图导向多模态对话(IM-Dial)。在我们的任务场景中,机器人首先接收一个意图导向的语句(如"我口渴了"),随后需通过与人类用户交互识别目标物体。基于该任务设定,我们提出一种新型机器人系统——语用性物体抓取系统(PROGrasp),该系统可解析用户意图并拾取目标物体。PROGrasp通过整合视觉定位、提问、物体抓取等模块,尤其是用于语用推理的答案解析模块,实现了Pragmatic-IOG。实验结果表明,PROGrasp在离线(即目标物体发现)和在线(即使用实体机械臂完成IOG)设置下均具有有效性。