Reliable benchmarking is critical for advancing Vision-Language-Action (VLA) models, as it reveals their generalization, robustness, and alignment of perception with language-driven manipulation tasks. However, existing benchmarks often provide limited or misleading assessments due to insufficient evaluation protocols that inadequately capture real-world distribution shifts. This work systematically rethinks VLA benchmarking from both evaluation and data perspectives, introducing LIBERO-X, a benchmark featuring: 1) A hierarchical evaluation protocol with progressive difficulty levels targeting three core capabilities: spatial generalization, object recognition, and task instruction understanding. This design enables fine-grained analysis of performance degradation under increasing environmental and task complexity; 2) A high-diversity training dataset collected via human teleoperation, where each scene supports multiple fine-grained manipulation objectives to bridge the train-evaluation distribution gap. Experiments with representative VLA models reveal significant performance drops under cumulative perturbations, exposing persistent limitations in scene comprehension and instruction grounding. By integrating hierarchical evaluation with diverse training data, LIBERO-X offers a more reliable foundation for assessing and advancing VLA development.
翻译:可靠的基准测试对于推进视觉-语言-动作模型的发展至关重要,因为它能揭示模型在语言驱动的操控任务中的泛化能力、鲁棒性以及感知与语言的协同程度。然而,现有基准测试常因评估协议不足、未能充分捕捉真实世界分布偏移,而提供有限甚至误导性的评估结果。本文从评估和数据两个角度系统性地反思了VLA基准测试,并引入了LIBERO-X基准,其特点包括:1)一个分层的评估协议,包含针对三项核心能力(空间泛化、物体识别和任务指令理解)的渐进式难度等级。该设计能够对模型在日益增长的环境和任务复杂性下的性能衰减进行细粒度分析;2)一个通过人类遥操作收集的高多样性训练数据集,其中每个场景支持多个细粒度操控目标,以弥合训练与评估之间的分布差距。对代表性VLA模型的实验表明,在累积扰动下模型性能出现显著下降,暴露出其在场景理解和指令落地方面存在持续局限性。通过将分层评估与多样化训练数据相结合,LIBERO-X为评估和推进VLA发展提供了一个更可靠的基础。