DM0：面向物理人工智能的具身原生视觉-语言-动作模型 (DM0: An Embodied-Native Vision-Language-Action Model towards Physical AI)

En Yu,Haoran Lv,Jianjian Sun,Kangheng Lin,Ruitao Zhang,Yukang Shi,Yuyang Chen,Ze Chen,Ziheng Zhang,Fan Jia,Kaixin Liu,Meng Zhang,Ruitao Hao,Saike Huang,Songhan Xie,Yu Liu,Zhao Wu,Bin Xie,Pengwei Zhang,Qi Yang,Xianchi Deng,Yunfei Wei,Enwen Zhang,Hongyang Peng,Jie Zhao,Kai Liu,Wei Sun,Yajun Wei,Yi Yang,Yunqiao Zhang,Ziwei Yan,Haitao Yang,Hao Liu,Haoqiang Fan,Haowei Zhang,Junwen Huang,Yang Chen,Yunchao Ma,Yunhuan Yang,Zhengyuan Du,Ziming Liu,Jiahui Niu,Yucheng Zhao,Daxin Jiang,Wenbin Tang,Xiangyu Zhang,Zheng Ge,Erjin Zhou,Tiancai Wang

from arxiv, Authors are listed in alphabetical order. Code is available at https://github.com/Dexmal/dexbotic

Moving beyond the traditional paradigm of adapting internet-pretrained models to physical tasks, we present DM0, an Embodied-Native Vision-Language-Action (VLA) framework designed for Physical AI. Unlike approaches that treat physical grounding as a fine-tuning afterthought, DM0 unifies embodied manipulation and navigation by learning from heterogeneous data sources from the onset. Our methodology follows a comprehensive three-stage pipeline: Pretraining, Mid-Training, and Post-Training. First, we conduct large-scale unified pretraining on the Vision-Language Model (VLM) using diverse corpora--seamlessly integrating web text, autonomous driving scenarios, and embodied interaction logs-to jointly acquire semantic knowledge and physical priors. Subsequently, we build a flow-matching action expert atop the VLM. To reconcile high-level reasoning with low-level control, DM0 employs a hybrid training strategy: for embodied data, gradients from the action expert are not backpropagated to the VLM to preserve generalized representations, while the VLM remains trainable on non-embodied data. Furthermore, we introduce an Embodied Spatial Scaffolding strategy to construct spatial Chain-of-Thought (CoT) reasoning, effectively constraining the action solution space. Experiments on the RoboChallenge benchmark demonstrate that DM0 achieves state-of-the-art performance in both Specialist and Generalist settings on Table30.

翻译：本文提出DM0，一种面向物理人工智能的具身原生视觉-语言-动作框架，突破了将互联网预训练模型适配至物理任务的传统范式。与将物理接地视为事后微调的方法不同，DM0从一开始就通过异构数据源学习，统一了具身操作与导航。我们的方法遵循完整的三阶段流程：预训练、中训练与后训练。首先，我们利用多样化语料——无缝整合网络文本、自动驾驶场景与具身交互日志——对视觉-语言模型进行大规模统一预训练，以同时获取语义知识与物理先验。随后，我们在VLM之上构建流匹配动作专家。为协调高层推理与底层控制，DM0采用混合训练策略：对于具身数据，动作专家的梯度不反向传播至VLM以保留泛化表征，而VLM在非具身数据上仍保持可训练性。此外，我们引入具身空间支架策略来构建空间思维链推理，有效约束动作解空间。在RoboChallenge基准测试上的实验表明，DM0在Table30任务的专家模式与通才模式下均实现了最先进的性能。