CapGeo: A Caption-Assisted Approach to Geometric Reasoning

Geometric reasoning remains a core challenge for Multimodal Large Language Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3 and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite exhibiting strong textual reasoning abilities on tasks like the International Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in understanding geometric diagrams rather than reasoning itself. Since geometric figures can often be faithfully described in concise textual form, converting visual content into captions offers a promising direction. Motivated by this insight, we introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities. Experiments show substantial improvements when models are equipped with captions: Qwen2.5-VL-72B improves from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to 73.0%. To systematically evaluate and identify high-quality geometric captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based evaluation metric that correlates strongly with downstream CapGeo performance, enabling reliable assessment of geometric captioning ability. Together, our framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs.

翻译：几何推理仍然是多模态大语言模型（MLLMs）面临的核心挑战。即使是最先进的闭源系统，如 GPT-O3 和 Gemini-2.5-Pro，尽管在国际数学奥林匹克（IMO）等任务上展现出强大的文本推理能力，仍难以可靠地解决几何问题。这一差距表明瓶颈在于对几何图表的理解，而非推理本身。由于几何图形通常可以用简洁的文本形式准确描述，将视觉内容转换为描述文本提供了一个有前景的方向。基于这一洞见，我们提出了 CapGeo，一种连接视觉与文本模态的描述辅助推理框架。实验表明，为模型配备描述后性能得到显著提升：Qwen2.5-VL-72B 从仅视觉的 8.6% 提升至 59.0%，而 Claude-Opus-4 则从 44.8% 提升至 73.0%。为了系统评估并识别高质量的几何描述生成模型，我们进一步提出了 CapGeo-Bench，一个包含 4,641 个精选图形-描述对的数据集。关键的是，CapGeo-Bench 引入了一种基于关键点的评估指标，该指标与下游 CapGeo 性能高度相关，从而能够可靠地评估几何描述生成能力。我们的框架与基准共同为推进 MLLMs 的几何推理能力指明了一条新途径。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日