Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part of Embodied AI. Despite successes in applying multimodal large language models for high-level understanding, it remains challenging to translate these conceptual understandings into detailed robotic actions while achieving generalization across various scenarios. In this paper, we propose a tree-structured multimodal code generation framework for generalized robotic behavior synthesis, termed RoboCodeX. RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units consisting of physical preferences such as affordance and safety constraints, and applies code generation to introduce generalization ability across various robotics platforms. To further enhance the capability to map conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training and an iterative self-updating methodology is introduced for supervised fine-tuning. Extensive experiments demonstrate that RoboCodeX achieves state-of-the-art performance in both simulators and real robots on four different kinds of manipulation tasks and one navigation task.
翻译:机器人行为合成是具身智能的重要组成部分,旨在理解多模态输入并生成精确的物理控制指令。尽管多模态大语言模型在高层次理解方面取得了成功,但如何将这些概念性理解转化为详细的机器人动作,并实现跨场景的泛化能力,仍是当前面临的挑战。本文提出了一种面向通用机器人行为合成的树结构多模态代码生成框架——RoboCodeX。该框架将高层次人类指令分解为多个以物体为核心的操作单元,这些单元包含物体可供性、安全约束等物理偏好,并通过代码生成技术赋予其跨机器人平台的泛化能力。为进一步增强将概念与感知理解映射为控制指令的能力,我们构建了专用多模态推理数据集用于预训练,并引入迭代自更新方法进行监督微调。大量实验表明,RoboCodeX在四种不同操作任务和一项导航任务中,于仿真器与实体机器人上均取得了最先进的性能。