Despite their proficiency in general tasks, Multi-modal Large Language Models (MLLMs) struggle with automatic Geometry Problem Solving (GPS), which demands understanding diagrams, interpreting symbols, and performing complex reasoning. This limitation arises from their pre-training on natural images and texts, along with the lack of automated verification in the problem-solving process. Besides, current geometric specialists are limited by their task-specific designs, making them less effective for broader geometric problems. To this end, we present GeoX, a multi-modal large model focusing on geometric understanding and reasoning tasks. Given the significant differences between geometric diagram-symbol and natural image-text, we introduce unimodal pre-training to develop a diagram encoder and symbol decoder, enhancing the understanding of geometric images and corpora. Furthermore, we introduce geometry-language alignment, an effective pre-training paradigm that bridges the modality gap between unimodal geometric experts. We propose a Generator-And-Sampler Transformer (GS-Former) to generate discriminative queries and eliminate uninformative representations from unevenly distributed geometric signals. Finally, GeoX benefits from visual instruction tuning, empowering it to take geometric images and questions as input and generate verifiable solutions. Experiments show that GeoX outperforms both generalists and geometric specialists on publicly recognized benchmarks, such as GeoQA, UniGeo, Geometry3K, and PGPS9k.
翻译:尽管多模态大语言模型(MLLMs)在通用任务上表现出色,但在自动几何问题求解(GPS)方面仍存在困难,因为该任务需要理解图表、解释符号并进行复杂推理。这一局限性源于其预训练数据主要基于自然图像和文本,以及求解过程中缺乏自动验证机制。此外,现有的几何专用模型受限于其任务特定的设计,难以有效处理更广泛的几何问题。为此,我们提出了GeoX,一个专注于几何理解与推理任务的多模态大模型。鉴于几何图表-符号与自然图像-文本之间存在显著差异,我们引入了单模态预训练来开发图表编码器和符号解码器,以增强对几何图像和语料的理解。进一步,我们提出了几何-语言对齐这一有效的预训练范式,以弥合单模态几何专家之间的模态鸿沟。我们设计了一种生成器-采样器Transformer(GS-Former),用于生成判别性查询并消除来自非均匀分布几何信号的无信息表征。最终,GeoX通过视觉指令微调获得增强能力,使其能够接收几何图像和问题作为输入,并生成可验证的求解过程。实验表明,在GeoQA、UniGeo、Geometry3K和PGPS9k等公认基准测试中,GeoX的性能均优于通用模型和几何专用模型。