Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose BayesianVLA, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $π(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, BayesianVLA significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.
翻译:视觉-语言-动作(VLA)模型在机器人操作任务中展现出潜力,但在泛化至新指令或复杂多任务场景时往往存在困难。我们发现当前训练范式存在一个关键缺陷:目标驱动的数据收集会导致数据集偏差。在此类数据集中,仅从视觉观测即可高度预测语言指令,导致指令与动作之间的条件互信息消失,我们将此现象称为信息坍缩。因此,模型退化为仅依赖视觉的策略,忽略语言约束并在分布外(OOD)场景中失效。为解决该问题,我们提出贝叶斯VLA——一种通过贝叶斯分解强制指令跟随的新型框架。通过引入可学习的潜在动作查询,我们构建了双分支架构以同时估计仅视觉先验 $p(a \mid v)$ 与语言条件后验 $π(a \mid v, \ell)$。随后通过最大化动作与指令间的条件逐点互信息(PMI)来优化策略。该目标函数能有效惩罚视觉捷径,并对显式解释语言指令的动作给予奖励。在不需新数据的情况下,贝叶斯VLA显著提升了泛化能力。在SimplerEnv和RoboCasa上的大量实验证明了其显著优势,其中在具有挑战性的OOD SimplerEnv基准上实现了11.3%的性能提升,验证了本方法在动作中鲁棒地锚定语言的能力。