In this paper, we introduce AceMath, a suite of frontier math models that excel in solving complex math problems, along with highly effective reward models capable of evaluating generated solutions and reliably identifying the correct ones. To develop the instruction-tuned math models, we propose a supervised fine-tuning (SFT) process that first achieves competitive performance across general domains, followed by targeted fine-tuning for the math domain using a carefully curated set of prompts and synthetically generated responses. The resulting model, AceMath-72B-Instruct greatly outperforms Qwen2.5-Math-72B-Instruct, GPT-4o and Claude-3.5 Sonnet. To develop math-specialized reward model, we first construct AceMath-RewardBench, a comprehensive and robust benchmark for evaluating math reward models across diverse problems and difficulty levels. After that, we present a systematic approach to build our math reward models. The resulting model, AceMath-72B-RM, consistently outperforms state-of-the-art reward models. Furthermore, when combining AceMath-72B-Instruct with AceMath-72B-RM, we achieve the highest average rm@8 score across the math reasoning benchmarks. We will release model weights, training data, and evaluation benchmarks at: https://research.nvidia.com/labs/adlr/acemath
翻译:本文介绍了AceMath,一套在解决复杂数学问题上表现卓越的前沿数学模型,以及能够评估生成解并可靠识别正确答案的高效奖励模型。为开发指令调优的数学模型,我们提出了监督微调流程:首先在通用领域实现具有竞争力的性能,随后使用精心构建的提示集与合成生成的响应,针对数学领域进行定向微调。所得模型AceMath-72B-Instruct显著优于Qwen2.5-Math-72B-Instruct、GPT-4o与Claude-3.5 Sonnet。为开发数学专用奖励模型,我们首先构建了AceMath-RewardBench——一个用于评估不同难度层级数学问题奖励模型的全面且鲁棒的基准。随后,我们提出了构建数学奖励模型的系统化方法。所得模型AceMath-72B-RM持续超越现有最优奖励模型。此外,将AceMath-72B-Instruct与AceMath-72B-RM结合使用时,我们在数学推理基准测试中取得了最高的平均rm@8分数。我们将通过以下地址发布模型权重、训练数据与评估基准:https://research.nvidia.com/labs/adlr/acemath