The ability to handle objects in cluttered environment has been long anticipated by robotic community. However, most of works merely focus on manipulation instead of rendering hidden semantic information in cluttered objects. In this work, we introduce the scene graph for embodied exploration in cluttered scenarios to solve this problem. To validate our method in cluttered scenario, we adopt the Manipulation Question Answering (MQA) tasks as our test benchmark, which requires an embodied robot to have the active exploration ability and semantic understanding ability of vision and language.As a general solution framework to the task, we propose an imitation learning method to generate manipulations for exploration. Meanwhile, a VQA model based on dynamic scene graph is adopted to comprehend a series of RGB frames from wrist camera of manipulator along with every step of manipulation is conducted to answer questions in our framework.The experiments on of MQA dataset with different interaction requirements demonstrate that our proposed framework is effective for MQA task a representative of tasks in cluttered scenario.
翻译:在杂乱环境中处理物体的能力一直是机器人领域长期追求的目标。然而,现有工作多聚焦于操作本身,而非挖掘杂乱物体中隐藏的语义信息。为解决这一问题,本研究引入场景图方法实现杂乱场景下的具身探索。为验证方法在杂乱场景中的有效性,我们采用操作问答任务作为测试基准,该任务要求具身机器人同时具备主动探索能力以及视觉与语言的语义理解能力。作为该任务的通用解决方案框架,我们提出一种模仿学习方法以生成探索性操作动作;同时,采用基于动态场景图的视觉问答模型,通过解析机械臂腕部摄像头采集的RGB图像序列,在每一步操作过程中同步进行语义理解以回答问题。针对具有不同交互需求的MQA数据集实验表明,我们提出的框架对于作为杂乱场景典型任务代表的MQA任务具有优异效果。