TangramSR: Can Vision-Language Models Reason in Continuous Geometric Space?

Humans excel at spatial reasoning tasks like Tangram puzzle assembly through cognitive processes involving mental rotation, iterative refinement, and visual feedback. Inspired by how humans solve Tangram puzzles through trial-and-error, observation, and correction, we design a framework that models these human cognitive mechanisms. However, comprehensive experiments across five representative Vision-Language Models (VLMs) reveal systematic failures in continuous geometric reasoning: average IoU of only 0.41 on single-piece tasks, dropping to 0.23 on two-piece composition, far below human performance where children can complete Tangram tasks successfully. This paper addresses a fundamental challenge in self-improving AI: can models iteratively refine their predictions at test time without parameter updates? We introduce a test-time self-refinement framework that combines in-context learning (ICL) with reward-guided feedback loops, inspired by human cognitive processes. Our training-free verifier-refiner agent applies recursive refinement loops that iteratively self-refine predictions based on geometric consistency feedback, achieving IoU improvements from 0.63 to 0.932 on medium-triangle cases without any model retraining. This demonstrates that incorporating human-inspired iterative refinement mechanisms through ICL and reward loops can substantially enhance geometric reasoning in VLMs, moving self-improving AI from promise to practice in continuous spatial domains. Our work is available at this anonymous link https://anonymous.4open.science/r/TangramVLM-F582/.

翻译：人类在七巧板拼图等空间推理任务中表现出色，这得益于涉及心理旋转、迭代优化和视觉反馈的认知过程。受人类通过试错、观察和修正来解决七巧板谜题的启发，我们设计了一个模拟这些人类认知机制的框架。然而，对五个代表性视觉语言模型的综合实验揭示了其在连续几何推理中的系统性缺陷：单块任务的平均交并比仅为0.41，双块组合任务中进一步降至0.23，远低于儿童能成功完成七巧板任务的人类表现。本文探讨了自改进人工智能的一个根本性挑战：模型能否在测试时不更新参数的情况下迭代优化其预测？受人类认知过程启发，我们提出了一种结合上下文学习与奖励引导反馈循环的测试时自优化框架。我们无需训练的验证器-优化器智能体应用递归优化循环，基于几何一致性反馈迭代地自优化预测，在中型三角形案例上将交并比从0.63提升至0.932，且无需任何模型重训练。这表明通过上下文学习和奖励循环融入人类启发的迭代优化机制，能显著增强视觉语言模型的几何推理能力，推动自改进人工智能在连续空间领域从理论承诺走向实践应用。我们的工作可通过此匿名链接获取：https://anonymous.4open.science/r/TangramVLM-F582/。