Humans can imagine and manipulate visual images mentally, a capability known as spatial visualization. While many multi-modal benchmarks assess reasoning on visible visual information, the ability to infer unseen relationships through spatial visualization remains insufficiently evaluated as a spatial skill. This reliance on publicly sourced problems from IQ tests or math competitions risks data contamination and compromises assessment reliability. To this end, we introduce SpatialViz-Bench, a comprehensive multi-modal benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 programmatically generated problems, a scalable framework that allows for expansion to ensure fair and continuously reliable evaluations. Our evaluation of 27 Multi-modal Large Language Models (MLLMs) reveals wide performance variations, demonstrates the benchmark's strong discriminative power, and uncovers counter-intuitive findings: Chain-of-Thought (CoT) prompting paradoxically degrades accuracy on open-source models. Through statistical and qualitative analysis of error types, SpatialViz-Bench demonstrates that state-of-the-art MLLMs exhibit deficiencies in spatial visualization tasks, thereby addressing a significant lacuna in the field. The benchmark data and evaluation code are publicly available.
翻译:人类能够在脑海中想象和操纵视觉图像,这种能力被称为空间可视化。尽管现有的多模态基准大多评估对可见视觉信息的推理能力,但通过空间可视化推断不可见关系的能力作为一种空间技能,其评估仍显不足。此外,依赖从智商测试或数学竞赛中公开获取的问题,存在数据污染的风险,并可能损害评估的可靠性。为此,我们提出了SpatialViz-Bench,一个全面的、用于评估空间可视化能力的多模态基准。该基准包含4种子能力下的12项任务,共计1,180个程序化生成的问题,并采用了一个可扩展的框架,允许未来进行扩展,以确保评估的公平性和持续的可靠性。我们对27个多模态大语言模型(MLLMs)的评估揭示了广泛的性能差异,证明了该基准强大的区分能力,并发现了一些反直觉的结果:思维链(CoT)提示法反而会降低开源模型的准确率。通过对错误类型的统计和定性分析,SpatialViz-Bench表明,当前最先进的多模态大语言模型在空间可视化任务上存在不足,从而填补了该领域的一个重要空白。基准数据和评估代码已公开提供。