ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI

Generalist embodied agents require more than object recognition: they must reason about spatial relations, actions, procedures, human intentions, environmental constraints, and commonsense consequences from situated visual observations. Yet existing visual and embodied question answering benchmarks often provide limited control over the reasoning dependencies being tested, making it difficult to distinguish grounded embodied reasoning from shortcut-driven visual or linguistic pattern matching. We present ERQA-Plus, a diagnostic benchmark for reasoning in embodied AI. ERQA-Plus contains 1,766 question-answer instances grounded in 711 robot-centric images and organized according to a structured taxonomy spanning perceptual, action-centric, social-interaction, navigation-environmental, and contextual commonsense reasoning. The dataset is constructed using a multi-stage generation and validation pipeline that combines taxonomy-guided question generation, automatic quality judging, iterative revision, and human assessment to improve visual grounding, answer validity, and reasoning quality. We benchmark representative general-purpose vision-language models and embodied models, including LLaVA-NeXT-8B, Prismatic-7B, MiniCPM-V-4.5-8B, Qwen3-VL, RoboRefer-8B, and RoboBrain2.5-8B. Although the strongest model, Qwen3-VL-32B, achieves 83.4% overall accuracy and 61.4 SBERT score, category-level results reveal persistent weaknesses in spatial reasoning, procedural reasoning, event prediction, and intention inference. ERQA-Plus therefore provides a fine-grained evaluation framework for measuring not only whether embodied agents answer correctly, but also which forms of embodied reasoning they can and cannot perform reliably. The dataset is available https://huggingface.co/datasets/huggingdas/erqa-plus and the project page at https://github.com/LUNAProject22/erqa-plus.

翻译：通用型具身智能体不仅需要物体识别能力，还必须基于情境化的视觉观测对空间关系、动作、流程、人类意图、环境约束及常识性后果进行推理。然而，现有的视觉与具身问答基准测试通常难以有效控制待测试的推理依赖关系，这使得区分基于真实场景的具身推理与依赖捷径的视觉或语言模式匹配变得困难。我们提出ERQA-Plus，一个面向具身AI推理的诊断性基准测试。该数据集包含1,766个问答实例，均基于711张以机器人为中心的图像，并根据结构化分类体系进行组织，涵盖感知、动作中心、社交交互、导航环境及情境常识推理。数据集采用多阶段生成与验证流水线构建，结合分类体系引导的问题生成、自动质量评估、迭代修正及人工评估，以增强视觉基础性、答案有效性和推理质量。我们评估了代表性通用视觉语言模型与具身模型，包括LLaVA-NeXT-8B、Prismatic-7B、MiniCPM-V-4.5-8B、Qwen3-VL、RoboRefer-8B和RoboBrain2.5-8B。尽管最强模型Qwen3-VL-32B取得了83.4%的整体准确率与61.4的SBERT分数，但类别级结果揭示了其在空间推理、程序推理、事件预测及意图推断方面的持续性弱点。因此，ERQA-Plus提供了一个细粒度评估框架，不仅衡量具身智能体是否回答正确，还能判断其能够可靠执行哪些具身推理形式。数据集获取地址：https://huggingface.co/datasets/huggingdas/erqa-plus，项目主页：https://github.com/LUNAProject22/erqa-plus。