SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Azmine Toushik Wasi,Wahid Faisal,Abdur Rahman,Mahfuz Ahmed Anik,Munem Shahriar,Mohsin Mahmud Topu,Sadia Tasnim Meem,Rahatun Nesa Priti,Sabrina Afroz Mitu,Md. Iqramul Hoque,Shahriyar Zaman Ridoy,Mohammed Eunus Ali,Majd Hawasly,Mohammad Raza,Md Rizwan Parvez

from arxiv, Accepted to ICLR 2026. 92 Pages. 42 Figures and 29 Tables

Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision-language models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts. SpatiaLab comprises 1,400 visual question-answer pairs across six major categories: Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple-choice and open-ended evaluation. Experiments across diverse state-of-the-art VLMs, including open- and closed-source models, reasoning-focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple-choice setup, InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans. In the open-ended setting, all models show a performance drop of around 10-25%, with GPT-5-mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry. By providing a diverse, real-world evaluation framework, SpatiaLab exposes critical challenges and opportunities for advancing VLMs' spatial reasoning, offering a benchmark to guide future research toward robust, human-aligned spatial understanding. SpatiaLab is available at: https://spatialab-reasoning.github.io/.

翻译：空间推理是人类认知的基本方面，但对于当代视觉-语言模型而言，它仍然是一个重大挑战。先前的研究主要依赖于合成或由大语言模型生成的环境，这些环境任务设计有限且设置类似谜题，未能捕捉到视觉-语言模型在真实世界中遇到的复杂性、视觉噪声以及多样化的空间关系。为解决这一问题，我们引入了SpatiaLab，这是一个用于评估视觉-语言模型在现实、无约束情境下空间推理能力的综合性基准。SpatiaLab包含1,400个视觉问答对，涵盖六大类别：相对定位、深度与遮挡、方向、大小与尺度、空间导航以及三维几何，每个类别下又分为五个子类别，共产生30种不同的任务类型。每个子类别至少包含25个问题，每个主要类别至少包含200个问题，支持多项选择和开放式评估。通过对包括开源与闭源模型、推理导向模型以及专门的空间推理模型在内的多种最先进视觉-语言模型进行实验，我们发现这些模型在空间推理能力方面与人类存在显著差距。在多项选择设置下，InternVL3.5-72B的准确率为54.93%，而人类为87.57%。在开放式设置下，所有模型的性能下降了约10-25%，其中GPT-5-mini得分最高，为40.93%，而人类为64.93%。这些结果凸显了模型在处理复杂空间关系、深度感知、导航和三维几何方面的关键局限性。通过提供一个多样化、真实世界的评估框架，SpatiaLab揭示了推进视觉-语言模型空间推理能力的关键挑战与机遇，并提供了一个基准来指导未来研究朝着稳健、与人类对齐的空间理解方向发展。SpatiaLab可在以下网址获取：https://spatialab-reasoning.github.io/。