Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach

In this paper we present a neurosymbolic architecture for coupling language-guided visual reasoning with robot manipulation. A non-expert human user can prompt the robot using unconstrained natural language, providing a referring expression (REF), a question (VQA), or a grasp action instruction. The system tackles all cases in a task-agnostic fashion through the utilization of a shared library of primitive skills. Each primitive handles an independent sub-task, such as reasoning about visual attributes, spatial relation comprehension, logic and enumeration, as well as arm control. A language parser maps the input query to an executable program composed of such primitives, depending on the context. While some primitives are purely symbolic operations (e.g. counting), others are trainable neural functions (e.g. visual grounding), therefore marrying the interpretability and systematic generalization benefits of discrete symbolic approaches with the scalability and representational power of deep networks. We generate a 3D vision-and-language synthetic dataset of tabletop scenes in a simulation environment to train our approach and perform extensive evaluations in both synthetic and real-world scenes. Results showcase the benefits of our approach in terms of accuracy, sample-efficiency, and robustness to the user's vocabulary, while being transferable to real-world scenes with few-shot visual fine-tuning. Finally, we integrate our method with a robot framework and demonstrate how it can serve as an interpretable solution for an interactive object-picking task, both in simulation and with a real robot. We make our datasets available in https://gtziafas.github.io/neurosymbolic-manipulation.

翻译：本文提出了一种神经符号架构，用于将语言引导的视觉推理与机器人操作相结合。非专业人类用户可通过非受限自然语言向机器人发出指令，提供指代表达式（REF）、视觉问答（VQA）或抓取动作指令。该系统以任务无关的方式处理所有场景，通过共享基本技能库实现。每个基本技能处理独立的子任务，如视觉属性推理、空间关系理解、逻辑与枚举以及机械臂控制。语言解析器根据上下文将输入查询映射为由这些基本技能组成的可执行程序。部分基本技能为纯符号操作（如计数），而其他则是可训练的神经函数（如视觉定位），从而融合了离散符号方法在可解释性与系统泛化方面的优势，以及深度网络在可扩展性与表征能力上的特点。我们在仿真环境中生成了一个包含桌面场景的三维视觉-语言合成数据集，用于训练所提方法，并在合成场景与真实场景中进行了广泛评估。结果表明，该方法在准确性、样本效率及对用户词汇鲁棒性方面表现优异，且通过少样本视觉微调即可迁移至真实场景。最后，我们将该方法集成至机器人框架，并展示了其在仿真环境及真实机器人上作为交互式物体抓取任务的可解释解决方案的能力。数据集已开源至 https://gtziafas.github.io/neurosymbolic-manipulation。