Thousand-GPU Large-Scale Training and Optimization Recipe for AI-Native Cloud Embodied Intelligence Infrastructure

Yongjian Guo,Yunxuan Ma,Haoran Sun,Zhong Guan,Shuai Di,Jing Long,Wanting Xu,Xiaodong Bai,Wen Huang,Yucheng Guo,Chen Zhou,Qiming Yang,Mingxi Luo,Tianyun Zhao,Hedan Yang,Song Wang,Xiaomeng Tian,Xiaolong Xiang,Zhen Sun,Yu Wei,Luqiao Wang,Yuzhen Li,Chenfeng Gu,Junwu Xiong,Yicheng Gong

Embodied intelligence is a key step towards Artificial General Intelligence (AGI), yet its development faces multiple challenges including data, frameworks, infrastructure, and evaluation systems. To address these issues, we have, for the first time in the industry, launched a cloud-based, thousand-GPU distributed training platform for embodied intelligence, built upon the widely adopted LeRobot framework, and have systematically overcome bottlenecks across the entire pipeline. At the data layer, we have restructured the data pipeline to optimize the flow of embodied training data. In terms of training, for the GR00T-N1.5 model, utilizing thousand-GPU clusters and data at the scale of hundreds of millions, the single-round training time has been reduced from 15 hours to just 22 minutes, achieving a 40-fold speedup. At the model layer, by combining variable-length FlashAttention and Data Packing, we have moved from sample redundancy to sequence integration, resulting in a 188% speed increase; π-0.5 attention optimization has accelerated training by 165%; and FP8 quantization has delivered a 140% speedup. On the infrastructure side, relying on high-performance storage, a 3.2T RDMA network, and a Ray-driven elastic AI data lake, we have achieved deep synergy among data, storage, communication, and computation. We have also built an end-to-end evaluation system, creating a closed loop from training to simulation to assessment. This framework has already been fully validated on thousand-GPU clusters, laying a crucial technical foundation for the development and application of next-generation autonomous intelligent robots, and is expected to accelerate the arrival of the era of human-machine integration.

翻译：具身智能是实现通用人工智能的关键一步，但其发展面临数据、框架、基础设施及评估体系等多重挑战。针对这些问题，我们基于业界广泛采用的LeRobot框架，首次在行业内构建了面向具身智能的云原生千GPU分布式训练平台，并系统性地突破了全流程瓶颈。在数据层，我们重构了数据流水线以优化具身训练数据流。在训练方面，针对GR00T-N1.5模型，利用千GPU集群和数亿量级数据，单轮训练时间从15小时缩短至22分钟，实现了40倍加速。在模型层，通过结合可变长FlashAttention与数据打包技术，我们从样本冗余转向序列集成，获得188%的速度提升；π-0.5注意力优化将训练速度提升165%；FP8量化则带来140%的加速。在基础设施层面，依托高性能存储、3.2T RDMA网络以及基于Ray的弹性AI数据湖，实现了数据、存储、通信与计算的深度协同。我们还构建了端到端评估体系，形成了从训练到仿真再到评估的完整闭环。该框架已在千GPU集群上得到全面验证，为下一代自主智能机器人的开发与应用奠定了关键技术基础，有望加速人机融合时代的到来。