Humans can imagine and manipulate visual images mentally, a capability known as spatial visualization. While many multi-modal benchmarks assess reasoning on visible visual information, the ability to infer unseen relationships through spatial visualization remains insufficiently evaluated as a spatial skill. This reliance on publicly sourced problems from IQ tests or math competitions risks data contamination and compromises assessment reliability. To this end, we introduce SpatialViz-Bench, a comprehensive multi-modal benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 programmatically generated problems, a scalable framework that allows for expansion to ensure fair and continuously reliable evaluations. Our evaluation of 27 Multi-modal Large Language Models (MLLMs) reveals wide performance variations, demonstrates the benchmark's strong discriminative power, and uncovers counter-intuitive findings: Chain-of-Thought (CoT) prompting paradoxically degrades accuracy on open-source models. Through statistical and qualitative analysis of error types, SpatialViz-Bench demonstrates that state-of-the-art MLLMs exhibit deficiencies in spatial visualization tasks, thereby addressing a significant lacuna in the field. The benchmark data and evaluation code are publicly available.
翻译:人类能够在脑海中想象并操纵视觉图像,这种能力被称为空间可视化。尽管现有的多模态基准大多评估对可见视觉信息的推理能力,但通过空间可视化推断未见关系的能力,作为一种空间技能,仍未得到充分评估。当前基准依赖于从智商测试或数学竞赛中公开获取的问题,这存在数据污染风险并损害评估的可靠性。为此,我们提出了SpatialViz-Bench,一个全面的空间可视化多模态基准,包含4种子能力下的12项任务,共计1,180个通过程序生成的问题。该基准提供了一个可扩展的框架,允许未来扩展,以确保公平且持续可靠的评估。我们对27个多模态大语言模型(MLLMs)的评估揭示了广泛的性能差异,证明了该基准强大的区分能力,并发现了一些反直觉的结论:思维链(CoT)提示法反而会降低开源模型的准确性。通过对错误类型的统计和定性分析,SpatialViz-Bench表明,当前最先进的多模态大语言模型在空间可视化任务上存在明显不足,从而弥补了该领域的一个重要空白。基准数据与评估代码均已公开。