Moving beyond the traditional paradigm of adapting internet-pretrained models to physical tasks, we present DM0, an Embodied-Native Vision-Language-Action (VLA) framework designed for Physical AI. Unlike approaches that treat physical grounding as a fine-tuning afterthought, DM0 unifies embodied manipulation and navigation by learning from heterogeneous data sources from the onset. Our methodology follows a comprehensive three-stage pipeline: Pretraining, Mid-Training, and Post-Training. First, we conduct large-scale unified pretraining on the Vision-Language Model (VLM) using diverse corpora--seamlessly integrating web text, autonomous driving scenarios, and embodied interaction logs-to jointly acquire semantic knowledge and physical priors. Subsequently, we build a flow-matching action expert atop the VLM. To reconcile high-level reasoning with low-level control, DM0 employs a hybrid training strategy: for embodied data, gradients from the action expert are not backpropagated to the VLM to preserve generalized representations, while the VLM remains trainable on non-embodied data. Furthermore, we introduce an Embodied Spatial Scaffolding strategy to construct spatial Chain-of-Thought (CoT) reasoning, effectively constraining the action solution space. Experiments on the RoboChallenge benchmark demonstrate that DM0 achieves state-of-the-art performance in both Specialist and Generalist settings on Table30.
翻译:本文提出DM0,一种面向物理人工智能的具身原生视觉-语言-动作框架,突破了将互联网预训练模型适配至物理任务的传统范式。与将物理接地视为事后微调的方法不同,DM0从一开始就通过异构数据源学习,统一了具身操作与导航。我们的方法遵循完整的三阶段流程:预训练、中训练与后训练。首先,我们利用多样化语料——无缝整合网络文本、自动驾驶场景与具身交互日志——对视觉-语言模型进行大规模统一预训练,以同时获取语义知识与物理先验。随后,我们在VLM之上构建流匹配动作专家。为协调高层推理与底层控制,DM0采用混合训练策略:对于具身数据,动作专家的梯度不反向传播至VLM以保留泛化表征,而VLM在非具身数据上仍保持可训练性。此外,我们引入具身空间支架策略来构建空间思维链推理,有效约束动作解空间。在RoboChallenge基准测试上的实验表明,DM0在Table30任务的专家模式与通才模式下均实现了最先进的性能。