Recent Vision-Language-Action (VLA) models report impressive success rates on standard robotic benchmarks, fueling optimism about general-purpose physical intelligence. However, recent evidence suggests a systematic misalignment between standard benchmark success and true embodied reasoning, raising the question of whether these high scores reflect genuine cognitive capability. To address this gap, we introduce BeTTER, a diagnostic Benchmark for Testing True Embodied Reasoning in robotic policies. BeTTER applies targeted causal interventions (e.g., spatial layout shifts, temporal extrapolation) while enforcing kinematic isolation to explicitly decouple high-level reasoning failures from low-level execution limits. Through systematic evaluation, we reveal that state-of-the-art VLAs catastrophically fail in dynamic scenarios, exhibiting severe lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse. Crucially, our mechanistic analysis traces these symptoms to fundamental architectural bottlenecks - such as capacity compression and myopic downsampling - which systematically degrade the model's foundational semantic representation. We demonstrate that highly static evaluation protocols effectively mask this degradation by allowing optimization to overfit to sensorimotor priors. Supported by real-world robotic validation, our findings confirm that this representational breakdown is not a simulation artifact, highlighting the critical need for future VLA paradigms to resolve the structural tension between high-frequency control and high-level reasoning.
翻译:近期,视觉-语言-动作(VLA)模型在标准机器人基准测试中报告了令人瞩目的成功率,这引发了对通用物理智能的乐观预期。然而,最新证据表明,标准基准测试的成功与真正的具身推理之间存在系统性偏差,引发了对这些高分是否反映真实认知能力的质疑。为弥合这一差距,我们提出了BeTTER,一种用于测试机器人策略中真实具身推理的诊断性基准。BeTTER在实施运动学隔离的同时,施加了有针对性的因果干预(如空间布局偏移、时间外推),以明确解耦高层推理失败与低层执行限制。通过系统性评估,我们发现最先进的VLA模型在动态场景中灾难性失效,表现出严重的词汇-运动学捷径、行为惯性及语义特征坍塌。关键在于,我们的机理分析将这些症状追溯至基本架构瓶颈——如容量压缩和短视下采样——这些瓶颈系统性削弱了模型的基础语义表征。我们证明,高度静态的评估协议通过允许优化过度拟合于感觉运动先验,有效掩盖了这种退化。受真实机器人验证支持,我们的发现确认这种表征崩溃并非模拟伪影,强调未来VLA范式亟需解决高频控制与高层推理之间的结构性张力。