Embodied intelligence now spans navigation, household assistance, manipulation, autonomous driving, aerial agents, and multimodal large-model control. This expansion has made benchmark construction a central bottleneck for reliable evaluation. Unlike static datasets, embodied benchmarks combine task specifications, environments, robot data, demonstrations, annotations, metrics, evaluation scripts, and release policies into a single evaluation system. This survey reviews the literature through a five-stage construction pipeline: requirement and task construction, data acquisition, data cleaning and annotation, benchmark suite generation and metric definition, and evaluation execution with diagnostic feedback. For each stage, the survey analyzes the transition from manual curation to traditional automation, foundation-model assistance, and agentic closed-loop workflows. It also compares qualitative construction costs across human labor, data and asset acquisition, compute and simulation, validation and debugging, governance and maintenance, and rework risk. The main conclusion is that automation does not simply reduce benchmark cost. Instead, it often shifts cost toward validation, auditability, version control, and long-term governance. Progress in embodied evaluation will therefore depend not only on larger benchmark suites, but also on construction pipelines that are diagnosable, auditable, and responsibly refreshable.
翻译:具身智能现已涵盖导航、家务辅助、操作、自动驾驶、空中智能体以及多模态大模型控制等多个领域。这一扩展使得基准构建成为实现可靠评估的核心瓶颈。与静态数据集不同,具身基准将任务规范、环境、机器人数据、示教、标注、评估指标、评估脚本以及发布策略整合为一个统一的评估系统。本综述通过一个五阶段构建流程来回顾相关文献:需求与任务构建、数据获取、数据清洗与标注、基准套件生成及指标定义,以及带诊断反馈的评估执行。针对每个阶段,本综述分析了从人工管理到传统自动化、基础模型辅助以及智能体闭环工作流程的转变过程。同时,它还在人力、数据和资产获取、计算与仿真、验证与调试、治理与维护以及返工风险方面比较了定性的构建成本。主要结论是:自动化并非简单降低基准成本,相反,它往往将成本转移到验证、可审计性、版本控制和长期治理上。因此,具身评估的进步不仅依赖于更大的基准套件,还有赖于可诊断、可审计且可持续更新迭代的构建流程。