Prevalent Vision-Language-Action (VLA) models are typically built upon Multimodal Large Language Models (MLLMs) and demonstrate exceptional proficiency in semantic understanding, but they inherently lack the capability to deduce physical world dynamics. Consequently, recent approaches have shifted toward World Models, typically formulated via video prediction; however, these methods often suffer from a lack of semantic grounding and exhibit brittleness when handling prediction errors. To synergize semantic understanding with dynamic predictive capabilities, we present InternVLA-A1. This model employs a unified Mixture-of-Transformers architecture, coordinating three experts for scene understanding, visual foresight generation, and action execution. These components interact seamlessly through a unified masked self-attention mechanism. Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. We pre-train these models on hybrid synthetic-real datasets spanning InternData-A1 and Agibot-World, covering over 533M frames. This hybrid training strategy effectively harnesses the diversity of synthetic simulation data while minimizing the sim-to-real gap. We evaluated InternVLA-A1 across 12 real-world robotic tasks and simulation benchmark. It significantly outperforms leading models like pi0 and GR00T N1.5, achieving a 14.5\% improvement in daily tasks and a 40\%-73.3\% boost in dynamic settings, such as conveyor belt sorting.
翻译:主流的视觉-语言-行动模型通常基于多模态大语言模型构建,在语义理解方面表现出色,但其本质上缺乏推断物理世界动态的能力。因此,近期研究转向了通常通过视频预测构建的世界模型;然而,这些方法往往缺乏语义基础,并且在处理预测误差时表现出脆弱性。为了协同语义理解与动态预测能力,我们提出了InternVLA-A1。该模型采用统一的混合专家Transformer架构,协调场景理解、视觉前瞻生成和动作执行三个专家模块。这些组件通过统一的掩码自注意力机制无缝交互。基于InternVL3和Qwen3-VL,我们实例化了20亿和30亿参数规模的InternVLA-A1模型。我们在涵盖InternData-A1和Agibot-World的混合合成-真实数据集上对模型进行了预训练,数据总量超过5.33亿帧。这种混合训练策略有效利用了合成仿真数据的多样性,同时最大程度减少了仿真到现实的差距。我们在12项真实世界机器人任务和仿真基准上评估了InternVLA-A1。其性能显著优于pi0和GR00T N1.5等领先模型,在日常任务中实现了14.5%的性能提升,在动态场景(如传送带分拣)中更获得了40%-73.3%的性能飞跃。