When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs

Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we introduce LIBERO-CF, the first counterfactual benchmark for VLAs that evaluates language following capability by assigning alternative instructions under visually plausible LIBERO layouts. Our evaluation reveals that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs. We propose Counterfactual Action Guidance (CAG), a simple yet effective dual-branch inference scheme that explicitly regularizes language conditioning in VLAs. CAG combines a standard VLA policy with a language-unconditioned Vision-Action (VA) module, enabling counterfactual comparison during action selection. This design reduces reliance on visual shortcuts, improves robustness on under-observed tasks, and requires neither additional demonstrations nor modifications to existing architectures or pretrained models. Extensive experiments demonstrate its plug-and-play integration across diverse VLAs and consistent improvements. For example, on LIBERO-CF, CAG improves $π_{0.5}$ by 9.7% in language following accuracy and 3.6% in task success on under-observed tasks using a training-free strategy, with further gains of 15.5% and 8.5%, respectively, when paired with a VA model. In real-world evaluations, CAG reduces counterfactual failures of 9.4% and improves task success by 17.2% on average.

翻译：视觉语言动作模型（VLAs）有望将语言指令落实到机器人控制中，但在实践中往往无法忠实地遵循语言指令。当面对缺乏强场景特定监督的指令时，VLAs 会出现反事实失效：它们基于数据集偏差诱导的视觉捷径采取行动，反复执行已习得的行为，并选择训练中频繁出现的对象，而忽略语言意图。为了系统地研究此问题，我们引入了 LIBERO-CF，这是首个针对 VLAs 的反事实基准测试，通过在视觉上合理的 LIBERO 布局下分配替代指令来评估语言遵循能力。我们的评估表明，反事实失效在最先进的 VLAs 中普遍存在且尚未得到充分探索。我们提出了反事实动作引导（CAG），一种简单而有效的双分支推理方案，能显式地正则化 VLAs 中的语言条件。CAG 将标准的 VLA 策略与一个无语言条件的视觉动作（VA）模块相结合，从而在动作选择过程中实现反事实比较。这种设计减少了对视觉捷径的依赖，提高了在观测不足任务上的鲁棒性，并且既不需要额外的演示，也无需修改现有架构或预训练模型。大量实验证明了其在不同 VLAs 中的即插即用集成能力以及持续的性能提升。例如，在 LIBERO-CF 上，CAG 使用免训练策略将语言遵循准确率 $π_{0.5}$ 提高了 9.7%，在观测不足任务上的任务成功率提高了 3.6%；当与 VA 模型配对使用时，分别进一步提升了 15.5% 和 8.5%。在真实世界评估中，CAG 平均减少了 9.4% 的反事实失效，并将任务成功率平均提高了 17.2%。