SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Hongcheng Gao,Hailong Qu,Jingyi Tang,Jiahao Wang,Zihao Huang,Hengkang Qiao,Shihong Huang,Junming Yang,Yi Li,Hongyixuan Yuan,Wenjie Li,Bohan Zeng,Wenbo Li,Bo Wang,Jianhui Liu,Olive Huang,Haoyang Huang,Wentao Zhang,Guoqing Huang,Nan Duan,Yinpeng Dong

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.

翻译：空间推理是多模态大语言模型感知并在物理世界中运行的基础能力。然而，现有基准测试主要依赖于被动评估（如静态视觉问答）或模拟器特定流程，未能评估通用的交互式空间理解能力。我们提出SpatialWorld，这是一个专门为评估多模态智能体在复杂真实世界任务中交互式空间理解能力而设计的统一基准测试。该基准整合了八个异构仿真后端，采用共享的、与模拟器无关的协议，包含跨不同领域（如日常家务、旅行、社交协作）的760个人工标注任务。智能体必须在仅依赖视觉的部分可观测条件下解决问题，主动收集以自我为中心的视觉证据，并通过多模态大语言模型原生的统一文本动作接口表达决策。为确保评估可靠性，每项任务都包含人工验证的初始状态、参考轨迹以及终端状态验证器。对15个先进智能体的评估显示，鲁棒的空间任务求解仍具挑战性：最强模型GPT-5的平均任务成功率仅为17.4%，而领先的开源模型Qwen-3.5达到14.1%。进一步分析揭示了任务成功与执行效率之间的明显不匹配，以及显著的领域特定性能差异。这些在主动探索和长期规划方面的瓶颈，使SpatialWorld成为未来空间智能体的严苛测试平台。