CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models

Large language models (LLMs) have obtained promising results in mathematical reasoning, which is a foundational skill for human intelligence. Most previous studies focus on improving and measuring the performance of LLMs based on textual math reasoning datasets (e.g., MATH, GSM8K). Recently, a few researchers have released English multimodal math datasets (e.g., MATHVISTA and MATH-V) to evaluate the effectiveness of large multimodal models (LMMs). In this paper, we release a Chinese multimodal math (CMM-Math) dataset, including benchmark and training parts, to evaluate and enhance the mathematical reasoning of LMMs. CMM-Math contains over 28,000 high-quality samples, featuring a variety of problem types (e.g., multiple-choice, fill-in-the-blank, and so on) with detailed solutions across 12 grade levels from elementary to high school in China. Specifically, the visual context may be present in the questions or opinions, which makes this dataset more challenging. Through comprehensive analysis, we discover that state-of-the-art LMMs on the CMM-Math dataset face challenges, emphasizing the necessity for further improvements in LMM development. We also propose a Multimodal Mathematical LMM (Math-LMM) to handle the problems with mixed input of multiple images and text segments. We train our model using three stages, including foundational pre-training, foundational fine-tuning, and mathematical fine-tuning. The extensive experiments indicate that our model effectively improves math reasoning performance by comparing it with the SOTA LMMs over three multimodal mathematical datasets.

翻译：大型语言模型（LLM）在数学推理方面已取得显著成果，而数学推理是人类智能的一项基础能力。先前的研究大多聚焦于基于文本数学推理数据集（如MATH、GSM8K）来改进和衡量LLM的性能。近期，部分研究者发布了英文多模态数学数据集（如MATHVISTA和MATH-V），以评估大规模多模态模型（LMM）的有效性。本文发布了一个中文多模态数学（CMM-Math）数据集，包含基准测试与训练两部分，旨在评估并增强LMM的数学推理能力。CMM-Math包含超过28,000个高质量样本，涵盖中国小学至高中12个年级的多种题型（如选择题、填空题等），并配有详细解答。特别地，视觉信息可能出现在问题或选项之中，这使本数据集更具挑战性。通过综合分析，我们发现当前最先进的LMM在CMM-Math数据集上仍面临困难，凸显了LMM发展进一步改进的必要性。我们还提出了一种多模态数学LMM（Math-LMM），以处理多幅图像与多个文本片段混合输入的问题。我们采用三阶段训练模型，包括基础预训练、基础微调与数学专项微调。大量实验表明，通过在三个多模态数学数据集上与最先进的LMM进行比较，我们的模型能有效提升数学推理性能。