Multimodal large language models (MLLMs) have achieved strong performance on perception-oriented tasks, yet their ability to perform mathematical spatial reasoning, defined as the capacity to parse and manipulate two- and three-dimensional relations, remains unclear. Humans easily solve textbook-style spatial reasoning problems with over 95\% accuracy, but we find that most leading MLLMs fail to reach even 60\% on the same tasks. This striking gap highlights spatial reasoning as a fundamental weakness of current models. To investigate this gap, we present \emph{MathSpatial}, the first large-scale and systematic dataset resource dedicated to mathematical spatial reasoning in MLLMs. \emph{MathSpatial} provides two complementary subsets: (i)~\emph{MathSpatial-Bench}, a rigorously curated evaluation set of 2{,}000 problems spanning 3 categories and 11 subtypes, designed to isolate spatial reasoning from perceptual noise; and (ii)~\emph{MathSpatial-Corpus}, a training set of 8{,}000 problems equipped with verified solutions and structured reasoning traces. All problems are sourced from authentic educational materials and undergo multi-stage quality control including deduplication, geometric consistency checking, and cross-validated solution verification. Benchmarking 16 leading MLLMs on \emph{MathSpatial-Bench} reveals that spatial reasoning remains a fundamental bottleneck: even GPT-5 lags behind human performance by over 35 percentage points, with particularly poor results on abstract deduction tasks. We further show that training on \emph{MathSpatial-Corpus} yields consistent improvements across model families, demonstrating the dataset's practical value for advancing spatial reasoning capabilities. \emph{MathSpatial} is publicly available at https://shuolucs.github.io/MathSpatial.
翻译:多模态大语言模型在感知导向任务上取得了强劲表现,但其执行数学空间推理的能力——即解析并操控二维与三维关系的能力——仍不明确。人类能以超过95%的准确率轻松解决教科书式的空间推理问题,但我们发现大多数领先的MLLMs在同一任务上的准确率甚至不足60%。这一显著差距凸显了空间推理是当前模型的基本弱点。为探究这一差距,我们提出了\emph{MathSpatial},这是首个面向MLLMs数学空间推理的大规模系统化数据集资源。\emph{MathSpatial}包含两个互补子集:(i)~\emph{MathSpatial-Bench},一个经过严格筛选的评估集,包含2000道问题,涵盖3种类别和11种子类型,旨在隔离空间推理与感知噪声;(ii)~\emph{MathSpatial-Corpus},一个训练集,包含8000道问题,配有经过验证的解答和结构化推理轨迹。所有问题均源自权威教育资料,并经过多阶段质量控制,包括去重、几何一致性检查以及交叉验证的解答校验。在\emph{MathSpatial-Bench}上对16个领先MLLMs进行基准测试后,发现空间推理仍是基本瓶颈:即使是GPT-5也落后人类表现超过35个百分点,尤其在抽象演绎任务上表现极差。我们进一步证明,在\emph{MathSpatial-Corpus}上训练能跨模型家族带来一致改进,彰显了该数据集在提升空间推理能力方面的实用价值。\emph{MathSpatial}公开获取地址为https://shuolucs.github.io/MathSpatial。