BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose BayesianVLA, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $π(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, BayesianVLA significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.

翻译：视觉-语言-动作（VLA）模型在机器人操作任务中展现出潜力，但在泛化至新指令或复杂多任务场景时往往表现不佳。我们识别出现有训练范式中的一个关键缺陷：目标驱动的数据收集会导致数据集偏差。在此类数据集中，仅从视觉观测即可高度预测语言指令，导致指令与动作之间的条件互信息趋于消失——这一现象我们称之为“信息坍缩”。其结果是，模型退化为仅依赖视觉的策略，忽略语言约束，并在分布外（OOD）场景中失效。为解决此问题，我们提出贝叶斯VLA，一种通过贝叶斯分解强制指令跟随的新型框架。通过引入可学习的潜在动作查询，我们构建了一个双分支架构，以同时估计仅视觉先验 $p(a \mid v)$ 和语言条件后验 $π(a \mid v, \ell)$。随后，我们优化策略以最大化动作与指令之间的条件点互信息（PMI）。该目标函数有效惩罚了视觉捷径，并奖励那些能显式解释语言命令的动作。在不需新数据的情况下，贝叶斯VLA显著提升了泛化能力。在 SimplerEnv 和 RoboCasa 上进行的大量实验证明了其显著优势，包括在具有挑战性的 OOD SimplerEnv 基准测试中实现了 11.3% 的性能提升，验证了我们的方法在动作中稳健地锚定语言的能力。