Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation

Shuo Lu,Jianjie Cheng,Yinuo Xu,Yongcan Yu,Lijun Sheng,Peijie Wang,Siru Jiang,Yongguan Hu,Run Ling,Yihua Shao,Ao Ma,Wei Feng,Lingxiao He,Meng Wang,Qianlong Xie,Xingxing Wang,Nicu Sebe,Ran He,Jian Liang

Multimodal large language models (MLLMs) have achieved strong performance on perception-oriented tasks, yet their ability to perform mathematical spatial reasoning, defined as the capacity to parse and manipulate two- and three-dimensional relations, remains unclear. Humans easily solve textbook-style spatial reasoning problems with over 95\% accuracy, but we find that most leading MLLMs fail to reach even 60\% on the same tasks. This striking gap highlights spatial reasoning as a fundamental weakness of current models. To investigate this gap, we present \emph{MathSpatial}, the first large-scale and systematic dataset resource dedicated to mathematical spatial reasoning in MLLMs. \emph{MathSpatial} provides two complementary subsets: (i)~\emph{MathSpatial-Bench}, a rigorously curated evaluation set of 2{,}000 problems spanning 3 categories and 11 subtypes, designed to isolate spatial reasoning from perceptual noise; and (ii)~\emph{MathSpatial-Corpus}, a training set of 8{,}000 problems equipped with verified solutions and structured reasoning traces. All problems are sourced from authentic educational materials and undergo multi-stage quality control including deduplication, geometric consistency checking, and cross-validated solution verification. Benchmarking 16 leading MLLMs on \emph{MathSpatial-Bench} reveals that spatial reasoning remains a fundamental bottleneck: even GPT-5 lags behind human performance by over 35 percentage points, with particularly poor results on abstract deduction tasks. We further show that training on \emph{MathSpatial-Corpus} yields consistent improvements across model families, demonstrating the dataset's practical value for advancing spatial reasoning capabilities. \emph{MathSpatial} is publicly available at https://shuolucs.github.io/MathSpatial.

翻译：多模态大语言模型在感知导向任务上取得了强劲表现，但其执行数学空间推理的能力——即解析并操控二维与三维关系的能力——仍不明确。人类能以超过95%的准确率轻松解决教科书式的空间推理问题，但我们发现大多数领先的MLLMs在同一任务上的准确率甚至不足60%。这一显著差距凸显了空间推理是当前模型的基本弱点。为探究这一差距，我们提出了\emph{MathSpatial}，这是首个面向MLLMs数学空间推理的大规模系统化数据集资源。\emph{MathSpatial}包含两个互补子集：(i)~\emph{MathSpatial-Bench}，一个经过严格筛选的评估集，包含2000道问题，涵盖3种类别和11种子类型，旨在隔离空间推理与感知噪声；(ii)~\emph{MathSpatial-Corpus}，一个训练集，包含8000道问题，配有经过验证的解答和结构化推理轨迹。所有问题均源自权威教育资料，并经过多阶段质量控制，包括去重、几何一致性检查以及交叉验证的解答校验。在\emph{MathSpatial-Bench}上对16个领先MLLMs进行基准测试后，发现空间推理仍是基本瓶颈：即使是GPT-5也落后人类表现超过35个百分点，尤其在抽象演绎任务上表现极差。我们进一步证明，在\emph{MathSpatial-Corpus}上训练能跨模型家族带来一致改进，彰显了该数据集在提升空间推理能力方面的实用价值。\emph{MathSpatial}公开获取地址为https://shuolucs.github.io/MathSpatial。