Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual recognition and semantic understanding. Nevertheless, their ability to perform precise compositional spatial reasoning remains largely unexplored. Existing benchmarks often involve relatively simple tasks and rely on semantic approximations or coarse relative positioning, while their evaluation metrics are typically limited and lack rigorous mathematical formulations. To bridge this gap, we introduce TangramPuzzle, a geometry-grounded benchmark designed to evaluate compositional spatial reasoning through the lens of the classic Tangram game. We propose the Tangram Construction Expression (TCE), a symbolic geometric framework that grounds tangram assemblies in exact, machine-verifiable coordinate specifications, to mitigate the ambiguity of visual approximation. We design two complementary tasks: Outline Prediction, which demands inferring global shapes from local components, and End-to-End Code Generation, which requires solving inverse geometric assembly problems. We conduct extensive evaluation experiments on advanced open-source and proprietary models, revealing an interesting insight: MLLMs tend to prioritize matching the target silhouette while neglecting geometric constraints, leading to distortions or deformations of the pieces.
翻译:多模态大语言模型(MLLMs)在视觉识别与语义理解方面取得了显著进展。然而,其在执行精确的组合空间推理方面的能力在很大程度上仍未得到充分探索。现有基准测试通常涉及相对简单的任务,并依赖于语义近似或粗略的相对定位,其评估指标通常有限且缺乏严格的数学表述。为弥补这一差距,我们引入了TangramPuzzle,这是一个基于几何的基准测试,旨在通过经典的七巧板游戏来评估组合空间推理能力。我们提出了七巧板构造表达式(TCE),这是一种符号化的几何框架,将七巧板拼图精确地锚定在可机器验证的坐标规范中,以减少视觉近似的模糊性。我们设计了两个互补的任务:轮廓预测(要求从局部组件推断全局形状)和端到端代码生成(要求解决逆向几何装配问题)。我们对先进的开源和专有模型进行了广泛的评估实验,揭示了一个有趣的发现:MLLMs倾向于优先匹配目标轮廓,而忽略了几何约束,导致拼图块发生扭曲或变形。