The Great March 100: 100 Detail-oriented Tasks for Evaluating Embodied AI Agents

Ziyu Wang,Chenyuan Liu,Yushun Xiang,Runhao Zhang,Qingbo Hao,Hongliang Lu,Houyu Chen,Zhizhong Feng,Kaiyue Zheng,Dehao Ye,Xianchao Zeng,Xinyu Zhou,Boran Wen,Jiaxin Li,Mingyu Zhang,Kecheng Zheng,Qian Zhu,Ran Cheng,Yong-Lu Li

Recently, with the rapid development of robot learning and imitation learning, numerous datasets and methods have emerged. However, these datasets and their task designs often lack systematic consideration and principles. This raises important questions: Do the current datasets and task designs truly advance the capabilities of robotic agents? Do evaluations on a few common tasks accurately reflect the differentiated performance of various methods proposed by different teams and evaluated on different tasks? To address these issues, we introduce the Great March 100 (\textbf{GM-100}) as the first step towards a robot learning Olympics. GM-100 consists of 100 carefully designed tasks that cover a wide range of interactions and long-tail behaviors, aiming to provide a diverse and challenging set of tasks to comprehensively evaluate the capabilities of robotic agents and promote diversity and complexity in robot dataset task designs. These tasks are developed through systematic analysis and expansion of existing task designs, combined with insights from human-object interaction primitives and object affordances. We collect a large amount of trajectory data on different robotic platforms and evaluate several baseline models. Experimental results demonstrate that the GM-100 tasks are 1) feasible to execute and 2) sufficiently challenging to effectively differentiate the performance of current VLA models. Our data and code are available at https://rhos.ai/research/gm-100.

翻译：近年来，随着机器人学习与模仿学习的快速发展，大量数据集与方法不断涌现。然而，这些数据集及其任务设计往往缺乏系统性的考量与原则。这引发了重要问题：当前的数据集与任务设计是否真正推动了机器人智能体能力的发展？在少数常见任务上的评估能否准确反映不同团队在不同任务上提出的各类方法的差异化性能？为解决这些问题，我们引入“伟大征程100”（\textbf{GM-100}）作为迈向机器人学习奥林匹克的第一步。GM-100包含100项精心设计的任务，覆盖广泛的交互行为与长尾场景，旨在提供多样化且富有挑战性的任务集合，以全面评估机器人智能体的能力，并促进机器人数据集任务设计的多样性与复杂性。这些任务通过对现有任务设计的系统性分析与扩展，结合人-物交互基元与物体可供性洞见开发而成。我们在不同机器人平台上收集了大量轨迹数据，并评估了若干基线模型。实验结果表明，GM-100任务具有以下特性：1）具备可执行性；2）具有足够挑战性，能有效区分当前视觉语言动作（VLA）模型的性能差异。我们的数据与代码公开于 https://rhos.ai/research/gm-100。