Embodied decision-making enables agents to translate high-level goals into executable actions through continuous interactions within the physical world, forming a cornerstone of general-purpose embodied intelligence. Large language models (LLMs), with their general decision-making capabilities, offer a promising path to realize this potential; however, LLMs trained solely on language lack exposure to physical environments, limiting their true embodied understanding. To bridge this gap, we propose the concept of a training ground: a comprehensive infrastructure that provides task and scene simulation, embodied interaction, and feedback signals, offering a one-stop solution for LLM acquire genuine embodied decision-making skills. In this work, we present EmboMatrix, the first training ground of its kind, providing massive and diverse tasks with efficient simulation and precise rewards. EmboMatrix incorporates a series of novel techniques: a multi-agent data engine for large-scale task and scene generation, a distributed heterogeneous-hardware system for scalable simulation, and a multi-level reward architecture for precise supervision. Leveraging EmboMatrix, we cultivate EmboBrain, an LLM whose embodied decision-making abilities emerge from extensive embodied interactions. Experiments show that EmboBrain-7B surpasses the 671B DeepSeek-R1 baseline by 9.5\% on two challenging embodied decision-making benchmarks, demonstrating the power of interactive, environment-grounded learning for building truly intelligent embodied agents.
翻译:具身决策使智能体能够通过在物理世界中的持续交互,将高层目标转化为可执行动作,构成通用具身智能的基石。大型语言模型(LLM)凭借其通用决策能力,为实现这一潜力提供了可行路径;然而,仅基于语言训练的LLM缺乏对物理环境的接触,限制了其真正的具身理解能力。为弥合这一差距,我们提出“训练场”的概念:一种提供任务与场景仿真、具身交互及反馈信号的一体化基础设施,为LLM获得真实具身决策技能提供一站式解决方案。本研究提出首个此类训练场EmboMatrix,它通过高效仿真与精准奖励机制提供海量多样化任务。EmboMatrix集成了一系列创新技术:用于大规模任务与场景生成的多智能体数据引擎、支持可扩展仿真的分布式异构硬件系统,以及实现精准监督的多层级奖励架构。依托EmboMatrix,我们培育出EmboBrain——其具身决策能力通过大量具身交互涌现的LLM。实验表明,EmboBrain-7B在两个具有挑战性的具身决策基准测试中,以9.5%的优势超越671B参数的DeepSeek-R1基线模型,证明了基于环境交互的扎根式学习对于构建真正智能具身体的强大效能。