Geometric spatial reasoning forms the foundation of many applications in artificial intelligence, yet the ability of large language models (LLMs) to operate over geometric spatial information expressed in procedural code remains underexplored. In this paper, we address this gap by formalizing the Program-to-Geometry task, which challenges models to translate programmatic drawing code into accurate and abstract geometric reasoning. To evaluate this capability, we present GeoGramBench, a benchmark of 500 carefully refined problems organized by a tailored three-level taxonomy that considers geometric complexity rather than traditional mathematical reasoning complexity. Our comprehensive evaluation of 17 frontier LLMs reveals consistent and pronounced deficiencies: even the most advanced models achieve less than 50% accuracy at the highest abstraction level. These results highlight the unique challenges posed by program-driven spatial reasoning and establish GeoGramBench as a valuable resource for advancing research in symbolic-to-spatial geometric reasoning. Project page: https://github.com/LiAuto-DSR/GeoGramBench.
翻译:几何空间推理是人工智能众多应用的基础,然而大语言模型(LLMs)处理以过程式代码表达的几何空间信息的能力仍未得到充分探索。本文通过形式化“程序到几何”任务来填补这一空白,该任务要求模型将程序化绘图代码转化为精确且抽象的几何推理。为评估此能力,我们提出了GeoGramBench——一个包含500个经过精心提炼问题的基准测试集,其按照定制的三级分类法组织,该分类法关注几何复杂性而非传统的数学推理复杂性。我们对17个前沿LLM的全面评估揭示了一致且显著的缺陷:即使最先进的模型在最高抽象级别上的准确率也不足50%。这些结果凸显了程序驱动的空间推理所带来的独特挑战,并确立了GeoGramBench作为推动符号到空间几何推理研究的重要资源。项目页面:https://github.com/LiAuto-DSR/GeoGramBench。