Vision-Language-Action (VLA) models achieve over 95% success on standard benchmarks. However, through systematic experiments, we find that current state-of-the-art VLA models largely ignore language instructions. Prior work lacks: (1) systematic semantic perturbation diagnostics, (2) a benchmark that forces language understanding by design, and (3) linguistically diverse training data. This paper constructs the LangGap benchmark, based on a four-dimensional semantic perturbation method -- varying instruction semantics while keeping the tabletop layout fixed -- revealing language understanding deficits in π0.5. Existing benchmarks like LIBERO assign only one task per layout, underutilizing available objects and target locations; LangGap fully diversifies pick-and-place tasks under identical layouts, forcing models to truly understand language. Experiments show that targeted data augmentation can partially close the language gap -- success rate improves from 0% to 90% with single-task training, and 0% to 28% with multi-task training. However, as semantic diversity of extended tasks increases, model learning capacity proves severely insufficient; even trained tasks perform poorly. This reveals a fundamental challenge for VLA models in understanding diverse language instructions -- precisely the long-term value of LangGap.
翻译:视觉-语言-动作(VLA)模型在标准基准测试中取得了超过95%的成功率。然而,通过系统性实验,我们发现当前最先进的VLA模型在很大程度上忽略了语言指令。先前的研究存在以下不足:(1)缺乏系统的语义扰动诊断方法;(2)缺少通过设计强制要求语言理解的基准测试;(3)缺乏语言多样性的训练数据。本文构建了LangGap基准,其基于一种四维语义扰动方法——在保持桌面布局固定的同时改变指令语义——从而揭示了π0.5模型在语言理解方面的缺陷。现有基准(如LIBERO)每个布局仅分配一个任务,未能充分利用可用物体和目标位置;而LangGap在相同布局下充分多样化抓放任务,迫使模型真正理解语言。实验表明,有针对性的数据增强可以部分弥合语言鸿沟——在单任务训练下,成功率从0%提升至90%;在多任务训练下,从0%提升至28%。然而,随着扩展任务语义多样性的增加,模型的学习能力表现出严重不足;即使是已训练的任务也表现不佳。这揭示了VLA模型在理解多样化语言指令方面面临的根本性挑战——这也正是LangGap的长期价值所在。