EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

Embodied AI is a crucial frontier in robotics, capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments. In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI, empowering embodied agents with multi-modal understanding and execution capabilities. To achieve this, we have made the following efforts: (i) We craft a large-scale embodied planning dataset, termed EgoCOT. The dataset consists of carefully selected videos from the Ego4D dataset, along with corresponding high-quality language instructions. Specifically, we generate a sequence of sub-goals with the "Chain of Thoughts" mode for effective embodied planning. (ii) We introduce an efficient training approach to EmbodiedGPT for high-quality plan generation, by adapting a 7B large language model (LLM) to the EgoCOT dataset via prefix tuning. (iii) We introduce a paradigm for extracting task-related features from LLM-generated planning queries to form a closed loop between high-level planning and low-level control. Extensive experiments show the effectiveness of EmbodiedGPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering. Notably, EmbodiedGPT significantly enhances the success rate of the embodied control task by extracting more effective features. It has achieved a remarkable 1.6 times increase in success rate on the Franka Kitchen benchmark and a 1.3 times increase on the Meta-World benchmark, compared to the BLIP-2 baseline fine-tuned with the Ego4D dataset.

翻译：具身人工智能是机器人领域的关键前沿，能够规划并执行动作序列，使机器人在物理环境中完成长期任务。在本工作中，我们提出了EmbodiedGPT，一种用于具身AI的端到端多模态基础模型，赋予具身智能体多模态理解与执行能力。为实现此目标，我们进行了以下努力：（i）构建一个大规模具身规划数据集，命名为EgoCOT。该数据集包含从Ego4D数据集中精心挑选的视频及相应的高质量语言指令。具体地，我们以“思维链”模式生成一系列子目标，以实现有效的具身规划。（ii）引入一种高效的训练方法，通过前缀微调将7B的大语言模型（LLM）适配至EgoCOT数据集，从而为EmbodiedGPT生成高质量规划。（iii）提出一种范式，从LLM生成的规划查询中提取任务相关特征，以形成高层规划与低层控制之间的闭环。大量实验表明，EmbodiedGPT在具身任务（包括具身规划、具身控制、视觉字幕生成及视觉问答）上具有有效性。值得注意的是，EmbodiedGPT通过提取更有效的特征，显著提升了具身控制任务的成功率：在Franka Kitchen基准测试中，相比基于Ego4D数据集微调的BLIP-2基线，成功率提升至1.6倍；在Meta-World基准测试中，成功率提升至1.3倍。