InternVLA-A1：统一机器人操作中的理解、生成与行动 (InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation)

Junhao Cai,Zetao Cai,Jiafei Cao,Yilun Chen,Zeyu He,Lei Jiang,Hang Li,Hengjie Li,Yang Li,Yufei Liu,Yanan Lu,Qi Lv,Haoxiang Ma,Jiangmiao Pang,Yu Qiao,Zherui Qiu,Yanqing Shen,Xu Shi,Yang Tian,Bolun Wang,Hanqing Wang,Jiaheng Wang,Tai Wang,Xueyuan Wei,Chao Wu,Yiman Xie,Boyang Xing,Yuqiang Yang,Yuyin Yang,Qiaojun Yu,Feng Yuan,Jia Zeng,Jingjing Zhang,Shenghan Zhang,Shi Zhang,Zhuoma Zhaxi,Bowen Zhou,Yuanzhen Zhou,Yunsong Zhou,Hongrui Zhu,Yangkun Zhu,Yuchen Zhu

from arxiv, Homepage: https://internrobotics.github.io/internvla-a1.github.io/

Prevalent Vision-Language-Action (VLA) models are typically built upon Multimodal Large Language Models (MLLMs) and demonstrate exceptional proficiency in semantic understanding, but they inherently lack the capability to deduce physical world dynamics. Consequently, recent approaches have shifted toward World Models, typically formulated via video prediction; however, these methods often suffer from a lack of semantic grounding and exhibit brittleness in the presence of video prediction errors. To synergize semantic understanding with dynamic predictive capabilities, we present InternVLA-A1. This model employs a unified Mixture-of-Transformers architecture, coordinating three experts for scene understanding, visual foresight generation, and action execution. These components interact seamlessly through a unified masked self attention mechanism. Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. We pre-train these models on heterogeneous data sources over real-world robot data, synthetic simulation data, and human videos, covering over 692M frames. This hybrid training strategy effectively harnesses the diversity of synthetic simulation data while minimizing the sim-to-real gap. We evaluated InternVLA-A1 on 12 real-world robotic tasks and a simulation benchmark. The results show that InternVLA-A1 consistently outperforms prior leading models: compared with pi0.5, it achieves +4.4\% on static manipulation tasks and +2.6\% on the RoboTwin 2.0 simulation benchmark, and delivers a +26.7\% boost on dynamic manipulation tasks.

翻译：主流的视觉-语言-行动（VLA）模型通常基于多模态大语言模型（MLLMs）构建，在语义理解方面表现出色，但其本质上缺乏推断物理世界动态的能力。因此，近期研究转向了通常通过视频预测构建的世界模型；然而，这些方法往往缺乏语义基础，并且在视频预测错误时表现出脆弱性。为了协同语义理解与动态预测能力，我们提出了InternVLA-A1。该模型采用统一的混合Transformer架构，协调三个专家模块分别负责场景理解、视觉前瞻生成和动作执行。这些组件通过统一的掩码自注意力机制无缝交互。基于InternVL3和Qwen3-VL，我们实例化了20亿和30亿参数规模的InternVLA-A1模型。我们利用异构数据源对模型进行预训练，包括真实世界机器人数据、合成仿真数据和人类视频，覆盖超过6.92亿帧。这种混合训练策略有效利用了合成仿真数据的多样性，同时最小化了仿真到真实的差距。我们在12项真实世界机器人任务和一个仿真基准上评估了InternVLA-A1。结果表明，InternVLA-A1始终优于先前的主流模型：与pi0.5相比，它在静态操作任务上提升了4.4%，在RoboTwin 2.0仿真基准上提升了2.6%，并在动态操作任务上实现了26.7%的性能提升。