Grounded understanding of natural language in physical scenes can greatly benefit robots that follow human instructions. In object manipulation scenarios, existing end-to-end models are proficient at understanding semantic concepts, but typically cannot handle complex instructions involving spatial relations among multiple objects. which require both reasoning object-level spatial relations and learning precise pixel-level manipulation affordances. We take an initial step to this challenge with a decoupled two-stage solution. In the first stage, we propose an object-centric semantic-spatial reasoner to select which objects are relevant for the language instructed task. The segmentation of selected objects are then fused as additional input to the affordance learning stage. Simply incorporating the inductive bias of relevant objects to a vision-language affordance learning agent can effectively boost its performance in a custom testbed designed for object manipulation with spatial-related language instructions.
翻译:物理场景中自然语言的接地理解能够极大提升机器人遵循人类指令的能力。在物体操作场景中,现有端到端模型虽擅长理解语义概念,但通常无法处理涉及多物体空间关系的复杂指令——这类任务既需要推理物体级空间关系,又需要学习精确的像素级操作语义。我们提出一种解耦的两阶段解决方案作为初步探索。第一阶段,我们设计了一个以物体为中心的语义-空间推理器,用于选择与语言指令任务相关的物体。所选物体的分割结果随后被融合为额外输入,进入操作语义学习阶段。简单地将相关物体的归纳偏置引入视觉-语言操作语义学习智能体,即可有效提升其在为空间相关语言指令设计的物体操作定制测试平台中的表现。