Considering how to make the model accurately understand and follow natural language instructions and perform actions consistent with world knowledge is a key challenge in robot manipulation. This mainly includes human fuzzy instruction reasoning and the following of physical knowledge. Therefore, the embodied intelligence agent must have the ability to model world knowledge from training data. However, most existing vision and language robot manipulation methods mainly operate in less realistic simulator and language settings and lack explicit modeling of world knowledge. To bridge this gap, we introduce a novel and simple robot manipulation framework, called Surfer. It is based on the world model, treats robot manipulation as a state transfer of the visual scene, and decouples it into two parts: action and scene. Then, the generalization ability of the model on new instructions and new scenes is enhanced by explicit modeling of the action and scene prediction in multi-modal information. In addition to the framework, we also built a robot manipulation simulator that supports full physics execution based on the MuJoCo physics engine. It can automatically generate demonstration training data and test data, effectively reducing labor costs. To conduct a comprehensive and systematic evaluation of the robot manipulation model in terms of language understanding and physical execution, we also created a robotic manipulation benchmark with progressive reasoning tasks, called SeaWave. It contains 4 levels of progressive reasoning tasks and can provide a standardized testing platform for embedded AI agents in multi-modal environments. On average, Surfer achieved a success rate of 54.74% on the defined four levels of manipulation tasks, exceeding the best baseline performance of 47.64%.
翻译:如何让模型准确理解并遵循自然语言指令,同时执行符合世界知识的行动,是机器人操作领域的关键挑战。这主要包括对人类模糊指令的推理以及物理知识的遵循。因此,具身智能体必须具备从训练数据中建模世界知识的能力。然而,现有的大多数视觉与语言机器人操作方法主要在仿真程度较低的模拟器和语言设定下运行,缺乏对世界知识的显式建模。为弥补这一差距,我们提出了一种新颖且简洁的机器人操作框架——Surfer。该框架基于世界模型,将机器人操作视为视觉场景的状态转移,并将其解耦为动作与场景两个部分。随后,通过对多模态信息中动作与场景预测的显式建模,增强模型在新指令和新场景上的泛化能力。除框架外,我们还基于MuJoCo物理引擎构建了一个支持完整物理执行的机器人操作模拟器。该模拟器可自动生成演示训练数据与测试数据,有效降低人工成本。为全面、系统地评估机器人操作模型在语言理解与物理执行方面的能力,我们还创建了一个包含渐进式推理任务的机器人操作基准——SeaWave。该基准包含4个级别的渐进式推理任务,可为多模态环境下的嵌入式智能体提供标准化测试平台。平均而言,Surfer在定义的四个级别操作任务上取得了54.74%的成功率,超越了最佳基线性能的47.64%。