INVIGORATE: Interactive Visual Grounding and Grasping in Clutter

This paper presents INVIGORATE, a robot system that interacts with human through natural language and grasps a specified object in clutter. The objects may occlude, obstruct, or even stack on top of one another. INVIGORATE embodies several challenges: (i) infer the target object among other occluding objects, from input language expressions and RGB images, (ii) infer object blocking relationships (OBRs) from the images, and (iii) synthesize a multi-step plan to ask questions that disambiguate the target object and to grasp it successfully. We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping. They allow for unrestricted object categories and language expressions, subject to the training datasets. However, errors in visual perception and ambiguity in human languages are inevitable and negatively impact the robot's performance. To overcome these uncertainties, we build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules. Through approximate POMDP planning, the robot tracks the history of observations and asks disambiguation questions in order to achieve a near-optimal sequence of actions that identify and grasp the target object. INVIGORATE combines the benefits of model-based POMDP planning and data-driven deep learning. Preliminary experiments with INVIGORATE on a Fetch robot show significant benefits of this integrated approach to object grasping in clutter with natural language interactions. A demonstration video is available at https://youtu.be/zYakh80SGcU.

翻译：本文提出了INVIGORATE，一种通过自然语言与人类交互并在杂乱场景中抓取指定物体的机器人系统。物体可能相互遮挡、阻塞甚至堆叠。INVIGORATE面临以下挑战：(i) 根据输入的语言表达和RGB图像，从相互遮挡的物体中推断出目标物体；(ii) 从图像中推断物体阻塞关系（OBRs）；(iii) 综合生成多步计划，提出用于区分目标物体的问题并成功抓取。我们分别训练了用于物体检测、视觉定位、问题生成以及OBR检测与抓取的神经网络。这些网络允许任意物体类别和语言表达，但其能力受限于训练数据集。然而，视觉感知中的错误和人类语言的歧义不可避免，会对机器人性能产生负面影响。为克服这些不确定性，我们构建了一个集成所学神经网络模块的部分可观测马尔可夫决策过程（POMDP）。通过近似POMDP规划，机器人追踪观测历史并询问消歧问题，以生成接近最优的动作序列来识别并抓取目标物体。INVIGORATE结合了基于模型的POMDP规划与数据驱动深度学习的优势。在Fetch机器人上的初步实验表明，这种集成方法在自然语言交互下的杂乱场景物体抓取中具有显著优势。演示视频见https://youtu.be/zYakh80SGcU。