In this work, we introduce HoloBrain-0, a comprehensive Vision-Language-Action (VLA) framework that bridges the gap between foundation model research and reliable real-world robot deployment. The core of our system is a novel VLA architecture that explicitly incorporates robot embodiment priors, including multi-view camera parameters and kinematic descriptions (URDF), to enhance 3D spatial reasoning and support diverse embodiments. We validate this design through a scalable ``pre-train then post-train" paradigm, achieving state-of-the-art results on simulation benchmarks such as RoboTwin 2.0, LIBERO, and GenieSim, as well as strong results on challenging long-horizon real-world manipulation tasks. Notably, our efficient 0.2B-parameter variant rivals significantly larger baselines, enabling low-latency on-device deployment. To further accelerate research and practical adoption, we fully open-source the entire HoloBrain ecosystem, which includes: (1) powerful pre-trained VLA foundations; (2) post-trained checkpoints for multiple simulation suites and real-world tasks; and (3) RoboOrchard, a full-stack VLA infrastructure for data curation, model training and deployment. Together with standardized data collection protocols, this release provides the community with a complete, reproducible path toward high-performance robotic manipulation.
翻译:本工作介绍了 HoloBrain-0,一个全面的视觉-语言-动作框架,旨在弥合基础模型研究与可靠的真实世界机器人部署之间的鸿沟。我们系统的核心是一种新颖的 VLA 架构,它显式地融合了机器人具身先验知识,包括多视角相机参数和运动学描述,以增强三维空间推理能力并支持多样化的具身形态。我们通过一个可扩展的“预训练后后训练”范式验证了该设计,在 RoboTwin 2.0、LIBERO 和 GenieSim 等仿真基准测试中取得了最先进的结果,同时在具有挑战性的长时程真实世界操作任务上也表现出色。值得注意的是,我们高效的 0.2B 参数变体可与规模大得多的基线模型相媲美,实现了低延迟的设备端部署。为了进一步加速研究和实际应用,我们完全开源了整个 HoloBrain 生态系统,其中包括:强大的预训练 VLA 基础模型;适用于多个仿真套件和真实世界任务的后训练检查点;以及 RoboOrchard,一个用于数据整理、模型训练和部署的全栈 VLA 基础设施。结合标准化的数据收集协议,此次发布为研究社区提供了一条通往高性能机器人操作的完整、可复现的路径。