Large language models (LLMs) have demonstrated significant capabilities in mathematical reasoning, particularly with text-based mathematical problems. However, current multi-modal large language models (MLLMs), especially those specialized in mathematics, tend to focus predominantly on solving geometric problems but ignore the diversity of visual information available in other areas of mathematics. Moreover, the geometric information for these specialized mathematical MLLMs is derived from several public datasets, which are typically limited in diversity and complexity. To address these limitations, we aim to construct a fine-tuning dataset named MathVL, and develop a series of specialized mathematical MLLMs termed MathGLM-Vision by conducting Supervised Fine-Tuning (SFT) on MathVL with various parameter-scale backbones. To extensively evaluate the effectiveness of MathGLM-Vision, we conduct experiments on several public benchmarks and our curated MathVL-test consisting of 2,000 problems. Experimental results demonstrate that MathGLM-Vision achieves significant improvements compared with some existing models, including backbone models and open-source mathematical MLLMs. These findings indicate the importance of diversity dataset in enhancing the mathematical reasoning abilities of MLLMs.
翻译:大语言模型(LLMs)在数学推理方面展现出显著能力,尤其在处理基于文本的数学问题上。然而,当前的多模态大语言模型(MLLMs),特别是专门针对数学领域的模型,往往主要集中于解决几何问题,而忽视了数学其他领域中可用的视觉信息多样性。此外,这些专用数学MLLMs所使用的几何信息通常来源于若干公开数据集,这些数据集在多样性和复杂性方面通常较为有限。为应对这些局限性,本研究旨在构建一个名为MathVL的微调数据集,并通过在MathVL上对不同参数规模的骨干模型进行监督微调(SFT),开发一系列专用的数学MLLMs,统称为MathGLM-Vision。为全面评估MathGLM-Vision的有效性,我们在多个公开基准测试集以及我们构建的包含2000道题目的MathVL-test上进行了实验。实验结果表明,与包括骨干模型和开源数学MLLMs在内的若干现有模型相比,MathGLM-Vision取得了显著提升。这些发现表明,多样性数据集对于增强MLLMs的数学推理能力至关重要。