The evaluation of mathematical reasoning capabilities is essential for advancing Artificial General Intelligence (AGI). While Large Language Models (LLMs) have shown impressive performance in solving mathematical problems, existing benchmarks such as GSM8K and MATH present limitations, including narrow problem definitions with specific numbers and reliance on predetermined rules that hinder accurate assessments of reasoning and adaptability. This paper introduces the UTMath Benchmark, which robustly evaluates the models through extensive unit tests. It consists of 1,053 problems across 9 mathematical domains, with over 68 test cases per problem.We propose an innovative evaluation framework inspired by unit testing in software development, focusing on both accuracy and reliability of results. Furthermore, we introduce the Reasoning-to-Coding of Thoughts (RCoT) approach, which encourages LLMs to perform explicit reasoning before generating code, leading to generating more advanced solution and improved performance. Furthermore, we are releasing not only the UTMath benchmark but also the UTMath-Train training dataset (more than 70k samples), to support the community in further exploring mathematical reasoning.
翻译:数学推理能力的评估对于推进通用人工智能(AGI)的发展至关重要。虽然大型语言模型(LLMs)在解决数学问题上已展现出令人瞩目的性能,但现有基准测试(如GSM8K和MATH)存在局限性,包括问题定义狭窄(使用特定数字)以及依赖预定规则,这阻碍了对模型推理能力和适应性的准确评估。本文介绍了UTMath基准测试,它通过广泛的单元测试对模型进行稳健评估。该基准包含9个数学领域的1,053个问题,每个问题配备超过68个测试用例。我们提出了一种受软件开发中单元测试启发的创新评估框架,重点关注结果的准确性和可靠性。此外,我们引入了推理到编码思维(RCoT)方法,该方法鼓励LLMs在生成代码之前进行显式推理,从而产生更先进的解决方案并提升性能。同时,我们不仅发布了UTMath基准测试,还发布了UTMath-Train训练数据集(超过7万个样本),以支持社区进一步探索数学推理。