To thoroughly assess the mathematical reasoning abilities of Large Language Models (LLMs), we need to carefully curate evaluation datasets covering diverse mathematical concepts and mathematical problems at different difficulty levels. In pursuit of this objective, we propose FineMath in this paper, a fine-grained mathematical evaluation benchmark dataset for assessing Chinese LLMs. FineMath is created to cover the major key mathematical concepts taught in elementary school math, which are further divided into 17 categories of math word problems, enabling in-depth analysis of mathematical reasoning abilities of LLMs. All the 17 categories of math word problems are manually annotated with their difficulty levels according to the number of reasoning steps required to solve these problems. We conduct extensive experiments on a wide range of LLMs on FineMath and find that there is still considerable room for improvements in terms of mathematical reasoning capability of Chinese LLMs. We also carry out an in-depth analysis on the evaluation process and methods that have been overlooked previously. These two factors significantly influence the model results and our understanding of their mathematical reasoning capabilities. The dataset will be publicly available soon.
翻译:为全面评估大语言模型的数学推理能力,需精心构建覆盖多元数学概念及不同难度层级数学问题的评测数据集。基于此目标,本文提出面向中文大语言模型的细粒度数学评测基准数据集FineMath。该数据集涵盖小学数学核心知识点,进一步细分为17类数学应用题,可对大语言模型的数学推理能力进行深度分析。所有17类数学应用题均依据解题所需推理步骤数进行人工难度标注。我们在FineMath上对多种大语言模型开展广泛实验,发现中文大语言模型在数学推理能力方面仍有显著提升空间。同时,我们对此前被忽视的评估过程与方法进行了深入分析——这两个因素显著影响模型结果及我们对其数学推理能力的认知。本数据集即将公开。