Large language models (LLMs) have demonstrated impressive reasoning capabilities, particularly in textual mathematical problem-solving. However, existing open-source image instruction fine-tuning datasets, containing limited question-answer pairs per image, do not fully exploit visual information to enhance the multimodal mathematical reasoning capabilities of Multimodal LLMs (MLLMs). To bridge this gap, we address the lack of high-quality, diverse multimodal mathematical datasets by collecting 40K high-quality images with question-answer pairs from 24 existing datasets and synthesizing 320K new pairs, creating the MathV360K dataset, which enhances both the breadth and depth of multimodal mathematical questions. We introduce Math-LLaVA, a LLaVA-1.5-based model fine-tuned with MathV360K. This novel approach significantly improves the multimodal mathematical reasoning capabilities of LLaVA-1.5, achieving a 19-point increase and comparable performance to GPT-4V on MathVista's minitest split. Furthermore, Math-LLaVA demonstrates enhanced generalizability, showing substantial improvements on the MMMU benchmark. Our research highlights the importance of dataset diversity and synthesis in advancing MLLMs' mathematical reasoning abilities. The code and data are available at: \url{https://github.com/HZQ950419/Math-LLaVA}.
翻译:大语言模型(LLMs)在文本数学问题求解方面已展现出卓越的推理能力。然而,现有开源图像指令微调数据集因每幅图像仅包含有限问答对,未能充分利用视觉信息来增强多模态大语言模型(MLLMs)的多模态数学推理能力。为弥补这一缺陷,我们通过从24个现有数据集中收集4万张高质量图像及其问答对,并合成32万组新问答对,构建了MathV360K数据集,从而解决了高质量多样化多模态数学数据集的缺失问题,显著拓展了多模态数学问题的广度与深度。我们提出了基于LLaVA-1.5架构、经MathV360K微调的Math-LLaVA模型。这一创新方法使LLaVA-1.5的多模态数学推理能力获得显著提升,在MathVista的minitest子集上实现了19个百分点的性能增长,达到与GPT-4V相当的水平。此外,Math-LLaVA展现出优异的泛化能力,在MMMU基准测试中取得显著进步。本研究揭示了数据集多样性与合成策略对于推进MLLMs数学推理能力的重要性。代码与数据已开源:\url{https://github.com/HZQ950419/Math-LLaVA}。