Geometric reasoning remains a core challenge for Multimodal Large Language Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3 and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite exhibiting strong textual reasoning abilities on tasks like the International Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in understanding geometric diagrams rather than reasoning itself. Since geometric figures can often be faithfully described in concise textual form, converting visual content into captions offers a promising direction. Motivated by this insight, we introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities. Experiments show substantial improvements when models are equipped with captions: Qwen2.5-VL-72B improves from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to 73.0%. To systematically evaluate and identify high-quality geometric captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based evaluation metric that correlates strongly with downstream CapGeo performance, enabling reliable assessment of geometric captioning ability. Together, our framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs.
翻译:几何推理仍然是多模态大语言模型(MLLMs)面临的核心挑战。即使是最先进的闭源系统,如 GPT-O3 和 Gemini-2.5-Pro,尽管在国际数学奥林匹克(IMO)等任务上展现出强大的文本推理能力,仍难以可靠地解决几何问题。这一差距表明瓶颈在于对几何图表的理解,而非推理本身。由于几何图形通常可以用简洁的文本形式准确描述,将视觉内容转换为描述文本提供了一个有前景的方向。基于这一洞见,我们提出了 CapGeo,一种连接视觉与文本模态的描述辅助推理框架。实验表明,为模型配备描述后性能得到显著提升:Qwen2.5-VL-72B 从仅视觉的 8.6% 提升至 59.0%,而 Claude-Opus-4 则从 44.8% 提升至 73.0%。为了系统评估并识别高质量的几何描述生成模型,我们进一步提出了 CapGeo-Bench,一个包含 4,641 个精选图形-描述对的数据集。关键的是,CapGeo-Bench 引入了一种基于关键点的评估指标,该指标与下游 CapGeo 性能高度相关,从而能够可靠地评估几何描述生成能力。我们的框架与基准共同为推进 MLLMs 的几何推理能力指明了一条新途径。