The evaluation of Vision-Language-Action (VLA) agents is hindered by the coarse, end-task success metric that fails to provide precise skill diagnosis or measure robustness to real-world perturbations. This challenge is exacerbated by a fragmented data landscape that impedes reproducible research and the development of generalist models. To address these limitations, we introduce NEBULA, a unified ecosystem for single-arm manipulation that enables diagnostic and reproducible evaluation. NEBULA features a novel dual-axis evaluation protocol that combines fine-grained capability tests for precise skill diagnosis with systematic stress tests that measure robustness. A standardized API and a large-scale, aggregated dataset are provided to reduce fragmentation and support cross-dataset training and fair comparison. Using NEBULA, we demonstrate that top-performing VLAs struggle with key capabilities such as spatial reasoning and dynamic adaptation, which are consistently obscured by conventional end-task success metrics. By measuring both what an agent can do and when it does so reliably, NEBULA provides a practical foundation for robust, general-purpose embodied agents.
翻译:视觉-语言-动作(VLA)智能体的评估受限于粗糙的最终任务成功率指标,该指标无法提供精确的技能诊断或衡量对现实世界扰动的鲁棒性。这一挑战因碎片化的数据格局而加剧,阻碍了可复现的研究和通用模型的开发。为解决这些局限性,我们提出了NEBULA——一个用于单臂操作的统一生态系统,支持诊断性和可复现的评估。NEBULA采用一种新颖的双轴评估协议,将用于精确技能诊断的细粒度能力测试与衡量鲁棒性的系统性压力测试相结合。我们提供了一个标准化API和一个大规模聚合数据集,以减少碎片化,支持跨数据集训练和公平比较。通过使用NEBULA,我们证明了当前表现最佳的VLA模型在空间推理和动态适应等关键能力上存在困难,而这些能力在传统的最终任务成功率指标中一直被掩盖。通过同时衡量智能体能够做什么以及在何种情况下能够可靠执行,NEBULA为构建鲁棒、通用的具身智能体奠定了实用基础。