GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra

Modern monocular 3D reconstruction methods and vision-language models (VLMs) demonstrate impressive results on standard benchmarks, yet recent works cast doubt on their true understanding of geometric properties. We introduce GOQ, a comprehensive benchmark specifically designed to evaluate the geometric reasoning capabilities of vision and vision-language foundation models. GIQ comprises synthetic and real-world images and corresponding 3D meshes of diverse polyhedra covering varying levels of complexity and symmetry, from Platonic, Archimedean, Johnson, and Catalan solids to stellations and compound shapes. Through systematic experiments involving monocular 3D reconstruction, 3D symmetry detection, mental rotation tests, and zero-shot shape classification tasks, we reveal significant shortcomings in current models. State-of-the-art reconstruction algorithms trained on extensive 3D datasets struggle to reconstruct even basic geometric Platonic solids accurately. Next, although foundation models may be shown via linear and non-linear probing to capture specific 3D symmetry elements, they falter significantly in tasks requiring detailed geometric differentiation, such as mental rotation. Moreover, advanced vision-language assistants such as ChatGPT, Gemini and Claud exhibit remarkably low accuracy in interpreting basic shape properties such as face geometry, convexity, and compound structures of complex polyhedra. GIQ is publicly available at toomanymatts.github.io/giq-benchmark/, providing a structured platform to benchmark critical gaps in geometric intelligence and facilitate future progress in robust, geometry-aware representation learning.

翻译：现代单目三维重建方法与视觉语言模型（VLM）在标准基准测试中展现出令人印象深刻的结果，然而近期研究对其真实几何属性理解能力提出了质疑。我们提出了GIQ——一个专门用于评估视觉与视觉语言基础模型几何推理能力的综合性基准。GIQ包含涵盖不同复杂度与对称性的多样化多面体合成图像、真实图像及对应三维网格，涵盖从柏拉图立体、阿基米德立体、约翰逊立体、卡塔兰立体到星形多面体与复合形状的完整谱系。通过系统性的单目三维重建、三维对称性检测、心理旋转测试与零样本形状分类实验，我们揭示了当前模型的显著缺陷：在大量三维数据集上训练的最先进重建算法甚至难以准确重建基础的柏拉图几何体；其次，尽管线性与非线性探针分析显示基础模型能捕捉特定三维对称元素，但在需要精细几何辨别的任务（如心理旋转）中表现严重不足；此外，ChatGPT、Gemini、Claude等先进视觉语言助手在解析复杂多面体的面几何、凸性、复合结构等基本形状属性时准确率极低。GIQ已在toomanymatts.github.io/giq-benchmark/公开，为系统评估几何智能的关键缺陷提供了结构化平台，并将推动鲁棒性几何感知表征学习的未来发展。