DamWorld: Progressive Reasoning with World Models for Robotic Manipulation

The research on embodied AI has greatly promoted the development of robot manipulation. However, it still faces significant challenges in various aspects such as benchmark construction, multi-modal perception and decision-making, and physical execution. Previous robot manipulation simulators were primarily designed to enrich manipulation types and types of objects while neglecting the balance between physical manipulation and language instruction complexity in multi-modal environments. This paper proposes a new robot manipulation simulator and builds a comprehensive and systematic robot manipulation benchmark with progressive reasoning tasks called SeaWave (i.e., a progressive reasoning benchmark). It provides a standard test platform for embedded AI agents in a multi-modal environment, which can evaluate and execute four levels of human natural language instructions at the same time. Previous world model-based robot manipulation work lacked research on the perception and decision-making of complex instructions in multi-modal environments. To this end, we propose a new world model tailored for cross-modal robot manipulation called DamWorld. Specifically, DamWorld takes the current visual scene and predicted execution actions based on natural language instructions as input, and uses the next action frame to supervise the output of the world model to force the model to learn robot manipulation consistent with world knowledge. Compared with the renowned baselines (e.g., RT-1), our DamWorld improves the manipulation success rate by 5.6% on average on four levels of progressive reasoning tasks. It is worth noting that on the most challenging level 4 manipulation task, DamWorld still improved by 9.0% compared to prior works.

翻译：具身智能研究极大地推动了机器人操控技术的发展，然而在基准构建、多模态感知决策及物理执行等方面仍面临重大挑战。以往的机器人操控模拟器主要侧重于丰富操控类型与物体类别，却忽视了多模态环境下物理操控与语言指令复杂度之间的平衡。本文提出一种新型机器人操控模拟器，并构建了名为SeaWave（即渐进式推理基准）的全面系统化机器人操控基准，该基准包含渐进式推理任务。它为多模态环境中的嵌入式智能体提供了标准测试平台，可同时评估和执行四个层级的人类自然语言指令。以往基于世界模型的机器人操控工作缺乏对多模态环境下复杂指令的感知与决策研究。为此，我们提出一种专为跨模态机器人操控设计的新型世界模型DamWorld。具体而言，DamWorld以当前视觉场景及基于自然语言指令预测的执行动作为输入，利用下一动作帧监督世界模型输出，迫使模型学习符合世界知识的机器人操控。与知名基线方法（如RT-1）相比，我们的DamWorld在四个层级的渐进式推理任务中平均操控成功率提升5.6%。值得一提的是，在最具挑战性的第四级操控任务中，DamWorld相较先前工作仍提升9.0%。