Vision-Language-Action (VLA) models have demonstrated strong performance across a wide range of robotic manipulation tasks. Despite the success, extending large pretrained Vision-Language Models (VLMs) to the action space can induce vision-action misalignment, where action predictions exhibit weak dependence on the current visual state, leading to unreliable action outputs. In this work, we study VLA models through the lens of visual conditioning and empirically show that successful rollouts consistently exhibit stronger visual dependence than failed ones. Motivated by this observation, we propose a training framework that explicitly strengthens visual conditioning in VLA models. Our approach first aligns action prediction with visual input via preference optimization on a track-following surrogate task, and then transfers the enhanced alignment to instruction-following task through latent-space distillation during supervised finetuning. Without introducing architectural modifications or additional data collection, our method improves both visual conditioning and task performance for discrete OpenVLA, and further yields consistent gains when extended to the continuous OpenVLA-OFT setting. Project website: https://vista-vla.github.io/ .
翻译:视觉-语言-动作(VLA)模型已在广泛的机器人操作任务中展现出强大的性能。尽管取得了成功,但将大型预训练的视觉-语言模型(VLM)扩展到动作空间可能会引发视觉-动作错位,即动作预测对当前视觉状态的依赖性较弱,从而导致不可靠的动作输出。在本工作中,我们从视觉条件化的角度研究VLA模型,并通过实证表明,成功的任务执行轨迹始终比失败的轨迹表现出更强的视觉依赖性。受此观察启发,我们提出了一种训练框架,旨在显式地增强VLA模型中的视觉条件化。我们的方法首先通过在轨迹跟随的代理任务上进行偏好优化,使动作预测与视觉输入对齐,然后在监督微调期间通过潜在空间蒸馏,将增强的对齐能力迁移到指令跟随任务中。无需引入架构修改或额外数据收集,我们的方法提升了离散动作空间OpenVLA模型的视觉条件化与任务性能,并且当扩展到连续动作空间的OpenVLA-OFT设置时,进一步带来了一致的性能提升。项目网站:https://vista-vla.github.io/ 。