VLA models have achieved remarkable progress in embodied intelligence; however, their evaluation remains largely confined to simulations or highly constrained real-world settings. This mismatch creates a substantial reality gap, where strong benchmark performance often masks poor generalization in diverse physical environments. We identify three systemic shortcomings in current benchmarking practices that hinder fair and reliable model comparison. (1) Existing benchmarks fail to model real-world dynamics, overlooking critical factors such as dynamic object configurations, robot initial states, lighting changes, and sensor noise. (2) Current protocols neglect spatial--physical intelligence, reducing evaluation to rote manipulation tasks that do not probe geometric reasoning. (3) The field lacks scalable fully autonomous evaluation, instead relying on simplistic 2D metrics that miss 3D spatial structure or on human-in-the-loop systems that are costly, biased, and unscalable. To address these limitations, we introduce RADAR (Real-world Autonomous Dynamics And Reasoning), a benchmark designed to systematically evaluate VLA generalization under realistic conditions. RADAR integrates three core components: (1) a principled suite of physical dynamics; (2) dedicated tasks that explicitly test spatial reasoning and physical understanding; and (3) a fully autonomous evaluation pipeline based on 3D metrics, eliminating the need for human supervision. We apply RADAR to audit multiple state-of-the-art VLA models and uncover severe fragility beneath their apparent competence. Performance drops precipitously under modest physical dynamics, with the expectation of 3D IoU declining from 0.261 to 0.068 under sensor noise. Moreover, models exhibit limited spatial reasoning capability. These findings position RADAR as a necessary bench toward reliable and generalizable real-world evaluation of VLA models.
翻译:视觉-语言-动作模型在具身智能领域取得了显著进展,但其评估仍主要局限于仿真或高度受限的真实世界场景。这种不匹配造成了巨大的现实鸿沟,即优异的基准测试性能往往掩盖了模型在多样化物理环境中泛化能力的不足。我们指出了当前评测实践中存在的三个系统性缺陷,这些缺陷阻碍了公平可靠的模型比较:(1) 现有基准未能建模真实世界动态,忽略了动态物体配置、机器人初始状态、光照变化和传感器噪声等关键因素;(2) 现有评测协议忽视了空间-物理智能,将评估简化为无需几何推理的机械操作任务;(3) 该领域缺乏可扩展的完全自主评估方案,转而依赖忽略三维空间结构的简化二维度量指标,或依赖成本高昂、存在偏差且不可扩展的人工介入系统。为应对这些局限,我们提出了RADAR基准,旨在系统性地评估视觉-语言-动作模型在真实条件下的泛化能力。RADAR整合了三个核心组成部分:(1) 一套基于物理原理的动态环境套件;(2) 专门用于测试空间推理与物理理解能力的任务集;(3) 基于三维度量指标的完全自主评估流程,无需人工监督。我们应用RADAR对多个前沿视觉-语言-动作模型进行审计,揭示了其表面能力之下存在的严重脆弱性。在适度的物理动态变化下,模型性能急剧下降——在传感器噪声影响下,三维交并比的期望值从0.261骤降至0.068。此外,模型表现出有限的空间推理能力。这些发现表明,RADAR是迈向可靠且可泛化的真实世界视觉-语言-动作模型评估的必要基准。