Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge. Recent embodied systems often adopt a dual-system paradigm, where System 2 handles high-level reasoning while System 1 executes low-level control. In this work, we refer to System 2 as the embodied brain, emphasizing its role as the cognitive core for reasoning and decision-making in manipulation tasks. Given this role, systematic evaluation of the embodied brain is essential. Yet existing benchmarks emphasize execution success, or when targeting high-level reasoning, suffer from incomplete dimensions and limited task realism, offering only a partial picture of cognitive capability. To bridge this gap, we introduce RoboBench, a benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains. Motivated by the critical roles across the full manipulation pipeline, RoboBench defines five dimensions-instruction comprehension, perception reasoning, generalized planning, affordance prediction, and failure analysis-spanning 14 capabilities, 25 tasks, and 6092 QA pairs. To ensure realism, we curate datasets across diverse embodiments, attribute-rich objects, and multi-view scenes, drawing from large-scale real robotic data. For planning, RoboBench introduces an evaluation framework, MLLM-as-world-simulator. It evaluate embodied feasibility by simulating whether predicted plans can achieve critical object-state changes. Experiments on 14 MLLMs reveal fundamental limitations: difficulties with implicit instruction comprehension, spatiotemporal reasoning, cross-scenario planning, fine-grained affordance understanding, and execution failure diagnosis. RoboBench provides a comprehensive scaffold to quantify high-level cognition, and guide the development of next-generation embodied MLLMs. The project page is in https://robo-bench.github.io.
翻译:构建能够在动态、非结构化环境中感知、推理与行动的机器人仍是核心挑战。现有具身系统常采用双系统范式:系统2负责高层推理,系统1执行底层控制。本工作中,我们将系统2定义为具身大脑,强调其在操作任务中作为推理与决策认知核心的作用。鉴于该角色,对具身大脑进行系统化评估至关重要。然而,现有基准或侧重执行成功率,或在针对高层推理时存在评估维度不完整、任务真实性不足的问题,仅能反映认知能力的局部面貌。为填补这一空白,我们提出RoboBench基准,用于系统化评估作为具身大脑的多模态大语言模型。基于完整操作流程中各环节的关键作用,RoboBench定义了五个评估维度——指令理解、感知推理、泛化规划、功能预测与故障分析——涵盖14项能力、25种任务及6092组问答对。为确保真实性,我们从大规模真实机器人数据中构建了涵盖多样本体、多属性物体及多视角场景的数据集。针对规划能力,RoboBench提出MLLM-as-world-simulator评估框架,通过模拟预测方案能否实现关键物体状态变化来评估具身可行性。在14个MLLM上的实验揭示了其根本性局限:隐含指令理解困难、时空推理能力不足、跨场景规划薄弱、细粒度功能认知欠缺以及执行故障诊断能力有限。RoboBench为量化高层认知提供了系统化框架,并将指导下一代具身MLLM的开发。项目页面详见https://robo-bench.github.io。