Recent advancements in Large Language Models (LLMs) have showcased striking results on existing logical reasoning benchmarks, with some models even surpassing human performance. However, the true depth of their competencies and robustness, in mathematical reasoning tasks, remains an open question. In response, we develop (i) an ontology of perturbations of maths questions, (ii) a semi-automatic method of perturbation, and (iii) a dataset of perturbed maths questions to probe the limits of LLM capabilities in mathematical reasoning tasks. These controlled perturbations span across multiple fine dimensions of the structural and representational aspects of maths questions. Using GPT-4, we generated the MORE dataset by perturbing randomly selected five seed questions from GSM8K. This process was guided by our ontology and involved a thorough automatic and manual filtering process, yielding a set of 216 maths problems. We conducted comprehensive evaluation of both closed-source and open-source LLMs on MORE. The results show a significant performance drop across all the models against the perturbed questions. This strongly suggests that current LLMs lack robust mathematical skills and deep reasoning abilities. This research not only identifies multiple gaps in the capabilities of current models, but also highlights multiple potential directions for future development. Our dataset will be made publicly available at https://huggingface.co/datasets/declare-lab/GSM8k_MORE.
翻译:近期大语言模型的进展在现有逻辑推理基准上展示了令人瞩目的成果,部分模型甚至超越人类表现。然而,它们在数学推理任务中能力的真实深度与鲁棒性仍是未解之谜。为此,我们开发了:(i)数学问题扰动的本体体系,(ii)半自动化的扰动方法,以及(iii)用于探测大语言模型在数学推理任务中能力极限的扰动问题数据集。这些受控扰动覆盖了数学问题结构与表征维度的多个精细层面。基于GPT-4,我们通过扰动GSM8K中随机选取的五个种子问题生成了MORE数据集。该过程以本体为引导,经过严格的自动与人工筛选,最终形成包含216个数学问题的集合。我们在MORE上对闭源与开源大语言模型进行了全面评估。结果显示,所有模型在面对扰动问题时性能均显著下降。这强烈表明,当前大语言模型缺乏稳健的数学技能与深层推理能力。本研究不仅揭示了现有模型能力的多重缺陷,也为未来发展指明了多个潜在方向。本数据集将在https://huggingface.co/datasets/declare-lab/GSM8k_MORE 公开提供。