Large vision-language models (VLMs) excel at multimodal understanding but fall short when extended to embodied tasks, where instructions must be transformed into low-level motor actions. We introduce ST4VLA, a dual-system Vision-Language-Action framework that leverages Spatial Guided Training to align action learning with spatial priors in VLMs. ST4VLA includes two stages: (i) spatial grounding pre-training, which equips the VLM with transferable priors via scalable point, box, and trajectory prediction from both web-scale and robot-specific data, and (ii) spatially guided action post-training, which encourages the model to produce richer spatial priors to guide action generation via spatial prompting. This design preserves spatial grounding during policy learning and promotes consistent optimization across spatial and action objectives. Empirically, ST4VLA achieves substantial improvements over vanilla VLA, with performance increasing from 66.1 -> 84.6 on Google Robot and from 54.7 -> 73.2 on WidowX Robot, establishing new state-of-the-art results on SimplerEnv. It also demonstrates stronger generalization to unseen objects and paraphrased instructions, as well as robustness to long-horizon perturbations in real-world settings. These results highlight scalable spatially guided training as a promising direction for robust, generalizable robot learning. Source code, data and models are released at https://internrobotics.github.io/internvla-m1.github.io/
翻译:大型视觉-语言模型(VLMs)在多模态理解方面表现出色,但在扩展到具身任务时存在不足,因为这类任务需要将指令转化为低层级的运动动作。我们提出了ST4VLA,一个双系统的视觉-语言-动作框架,它利用空间引导训练来将动作学习与VLM中的空间先验对齐。ST4VLA包含两个阶段:(i)空间基础预训练,通过从网络规模和机器人特定数据中进行可扩展的点、框和轨迹预测,为VLM装备可迁移的先验知识;(ii)空间引导的动作后训练,通过空间提示鼓励模型产生更丰富的空间先验来指导动作生成。这种设计在策略学习过程中保持了空间基础性,并促进了空间目标与动作目标之间的一致优化。实验表明,ST4VLA相比原始VLA取得了显著提升,在Google Robot上的性能从66.1提升至84.6,在WidowX Robot上从54.7提升至73.2,在SimplerEnv上创造了新的最先进结果。它还展现出对未见物体和转述指令更强的泛化能力,以及对现实世界中长时程扰动的鲁棒性。这些结果突显了可扩展的空间引导训练是通往鲁棒、可泛化机器人学习的一个有前景的方向。源代码、数据和模型发布于 https://internrobotics.github.io/internvla-m1.github.io/