A recent trend in vision-language models (VLMs) has been to enhance their spatial cognition for embodied domains. Despite progress, existing evaluations have been limited both in paradigm and in coverage, hindering rapid, iterative model development. To address these limitations, we propose ESPIRE, a diagnostic benchmark for embodied spatial reasoning. ESPIRE offers a simulated world that physically grounds VLMs and evaluates them on spatial-reasoning-centric robotic tasks, thus narrowing the gap between evaluation and real-world deployment. To adapt VLMs to robotic tasks, we decompose each task into localization and execution, and frame both as generative problems, in stark contrast to predominant discriminative evaluations (e.g., via visual-question answering) that rely on distractors and discard execution. This decomposition further enables a fine-grained analysis beyond passive spatial reasoning toward reasoning to act. We systematically design ESPIRE both at the instruction level and at the environment level, ensuring broad coverage of spatial reasoning scenarios. We use ESPIRE to diagnose a range of frontier VLMs and provide in-depth analysis of their spatial reasoning behaviors.
翻译:近期,视觉-语言模型(VLMs)的一个发展趋势是增强其在具身领域中的空间认知能力。尽管已取得进展,但现有评估在范式与覆盖范围上均存在局限,阻碍了模型的快速迭代开发。为应对这些局限,我们提出了ESPIRE,一个面向具身空间推理的诊断性基准。ESPIRE提供了一个模拟世界,将VLMs物理地置于其中,并在以空间推理为核心的机器人任务上对其进行评估,从而缩小了评估与实际部署之间的差距。为使VLMs适应机器人任务,我们将每项任务分解为定位与执行两个阶段,并将二者均构建为生成式问题。这与当前主流的基于判别式的评估(例如通过视觉问答)形成鲜明对比,后者依赖干扰项且忽略了执行环节。这种分解进一步支持了从被动空间推理到为行动而推理的细粒度分析。我们在指令层面和环境层面系统性地设计了ESPIRE,确保其广泛覆盖各类空间推理场景。我们使用ESPIRE对一系列前沿VLM进行诊断,并对其空间推理行为进行了深入分析。