Embodied reasoning systems integrate robotic hardware and cognitive processes to perform complex tasks typically in response to a natural language query about a specific physical environment. This usually involves changing the belief about the scene or physically interacting and changing the scene (e.g. 'Sort the objects from lightest to heaviest'). In order to facilitate the development of such systems we introduce a new simulating environment that makes use of MuJoCo physics engine and high-quality renderer Blender to provide realistic visual observations that are also accurate to the physical state of the scene. Together with the simulator we propose a new benchmark composed of 10 classes of multi-step reasoning scenarios that require simultaneous visual and physical measurements. Finally, we develop a new modular Closed Loop Interactive Reasoning (CLIER) approach that takes into account the measurements of non-visual object properties, changes in the scene caused by external disturbances as well as uncertain outcomes of robotic actions. We extensively evaluate our reasoning approach in simulation and in the real world manipulation tasks with a success rate above 76% and 64%, respectively.
翻译:具身推理系统整合机器人硬件与认知过程,通常针对特定物理环境中的自然语言查询执行复杂任务。这通常涉及改变对场景的认知信念,或通过物理交互改变场景本身(例如“按重量从轻到重对物体排序”)。为促进此类系统的开发,我们引入一种新型仿真环境,该环境利用MuJoCo物理引擎与高质量渲染器Blender,提供与场景物理状态精确匹配的真实视觉观测。结合该仿真器,我们提出一个包含10类多步推理场景的新基准测试,这些场景需同时进行视觉与物理测量。最后,我们开发了一种模块化闭环交互推理方法(CLIER),该方法综合考虑非视觉物体属性测量、外部干扰引起的场景变化以及机器人动作的不确定性结果。我们在仿真环境中和真实世界的操作任务中对该推理方法进行了广泛评估,成功率分别达到76%以上和64%以上。