EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied Intelligence

Ding Zou,Feifan Wang,Mengyu Ge,Siyuan Fan,Zongbing Zhang,Wei Chen,Lingfeng Wang,Zhongyou Hu,Wenrui Yan,Zhengwei Gao,Hao Wang,Weizhao Jin,Yu Zhang,Hainan Zhao,Mingliang Zhang,Xianxian Xi,Yaru Zhang,Wenyuan Li,Zhengguang Gao,Yurui Zhu

The realization of Artificial General Intelligence (AGI) necessitates Embodied AI agents capable of robust spatial perception, effective task planning, and adaptive execution in physical environments. However, current large language models (LLMs) and multimodal LLMs (MLLMs) for embodied tasks suffer from key limitations, including a significant gap between model design and agent requirements, an unavoidable trade-off between real-time latency and performance, and the use of unauthentic, offline evaluation metrics. To address these challenges, we propose EmbodiedBrain, a novel vision-language foundation model available in both 7B and 32B parameter sizes. Our framework features an agent-aligned data structure and employs a powerful training methodology that integrates large-scale Supervised Fine-Tuning (SFT) with Step-Augumented Group Relative Policy Optimization (Step-GRPO), which boosts long-horizon task success by integrating preceding steps as Guided Precursors. Furthermore, we incorporate a comprehensive reward system, including a Generative Reward Model (GRM) accelerated at the infrastructure level, to improve training efficiency. For enable thorough validation, we establish a three-part evaluation system encompassing General, Planning, and End-to-End Simulation Benchmarks, highlighted by the proposal and open-sourcing of a novel, challenging simulation environment. Experimental results demonstrate that EmbodiedBrain achieves superior performance across all metrics, establishing a new state-of-the-art for embodied foundation models. Towards paving the way for the next generation of generalist embodied agents, we open-source all of our data, model weight, and evaluating methods, which are available at https://zterobot.github.io/EmbodiedBrain.github.io.

翻译：实现通用人工智能（AGI）需要具备在物理环境中进行鲁棒空间感知、有效任务规划和自适应执行的具身AI智能体。然而，当前用于具身任务的大语言模型（LLMs）和多模态大语言模型（MLLMs）存在关键局限，包括模型设计与智能体需求之间存在显著差距、实时延迟与性能之间不可避免的权衡，以及使用非真实的离线评估指标。为应对这些挑战，我们提出了EmbodiedBrain，一个新颖的视觉语言基础模型，提供7B和32B两种参数量版本。我们的框架采用智能体对齐的数据结构，并运用一种强大的训练方法，该方法整合了大规模监督微调（SFT）与步骤增强组相对策略优化（Step-GRPO），通过将先前步骤整合为引导先验，从而提升长视野任务的成功率。此外，我们引入了一个全面的奖励系统，包括在基础设施层面加速的生成式奖励模型（GRM），以提高训练效率。为了进行全面验证，我们建立了一个包含通用、规划和端到端仿真基准的三部分评估体系，其亮点在于提出并开源了一个新颖且具有挑战性的仿真环境。实验结果表明，EmbodiedBrain在所有指标上均取得了卓越性能，为具身基础模型树立了新的技术标杆。为了为下一代通用具身智能体的发展铺平道路，我们开源了全部数据、模型权重及评估方法，相关资源可在 https://zterobot.github.io/EmbodiedBrain.github.io 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日