EAGLE: Elevating Geometric Reasoning through LLM-empowered Visual Instruction Tuning

Multi-modal Large Language Models (MLLMs) have advanced greatly in general tasks. However, they still face challenges in geometric reasoning, a task that requires synergistic integration of visual recognition proficiency and complex reasoning strength. Existing MLLMs prioritize optimizing the LLM backbone to enhance problem-solving capabilities, while rarely emphasizing improvements in discerning visual elements. However, we reveal that MLLMs suffer from severe visual perception deficiencies, including inaccurate geometric comprehension and severe visual hallucinations, which constrain their reasoning performance. To address this issue, we revisit geometric reasoning through a visual-centric lens that highlights the role of visual perception. To achieve this, we propose EAGLE, a novel coarse-to-fine visual enhancement framework that progressively leverages LLMs' guidance to improve perception proficiency. Specifically, given the substantial disparity between geometric diagrams and natural images, we first introduce Geometric Knowledge Injection. This process explores fundamental knowledge from diagram-caption data to enhance recognition capabilities and improve geometry-language alignments. Then, recognizing that different elements contribute unequally in the reasoning process, we introduce Geometric Knowledge Refinement. This stage leverages LLM-driven chain-of-thought solutions to guide the vision encoder in adaptively prioritizing key elements, fostering a synergistic interplay between visual comprehension and mathematical reasoning. Finally, we develop EAGLE, a geometry expert with strong perception and reasoning capabilities. Extensive experiments demonstrate its effectiveness on three popular benchmarks.

翻译：多模态大语言模型（MLLMs）在通用任务上已取得显著进展。然而，它们在几何推理任务上仍面临挑战，该任务需要视觉识别能力与复杂推理能力的协同整合。现有MLLMs主要侧重于优化大语言模型主干以增强问题解决能力，而很少强调提升对视觉元素的辨识能力。但我们发现，MLLMs存在严重的视觉感知缺陷，包括不准确的几何理解与严重的视觉幻觉，这制约了其推理性能。为解决此问题，我们通过以视觉为中心的视角重新审视几何推理，强调视觉感知的作用。为此，我们提出EAGLE——一种新颖的由粗到精视觉增强框架，该框架逐步利用大语言模型的指导来提升感知能力。具体而言，鉴于几何图表与自然图像之间存在显著差异，我们首先引入几何知识注入。该过程从图表-标题数据中挖掘基础知识，以增强识别能力并改善几何与语言的对应关系。随后，考虑到不同元素在推理过程中的贡献不均等，我们引入几何知识精炼。该阶段利用大语言模型驱动的思维链解决方案，引导视觉编码器自适应地优先处理关键元素，促进视觉理解与数学推理的协同互动。最终，我们开发出具备强大感知与推理能力的几何专家模型EAGLE。大量实验在三个主流基准测试中验证了其有效性。