In math reasoning with large language models (LLMs), fine-tuning data augmentation by query evolution and diverse reasoning paths is empirically verified effective, profoundly narrowing the gap between open-sourced LLMs and cutting-edge proprietary LLMs. In this paper, we conduct an investigation for such data augmentation in math reasoning and are intended to answer: (1) What strategies of data augmentation are more effective; (2) What is the scaling relationship between the amount of augmented data and model performance; and (3) Can data augmentation incentivize generalization to out-of-domain mathematical reasoning tasks? To this end, we create a new dataset, AugGSM8K, by complicating and diversifying the queries from GSM8K and sampling multiple reasoning paths. We obtained a series of LLMs called MuggleMath by fine-tuning on subsets of AugGSM8K. MuggleMath substantially achieves new state-of-the-art on GSM8K (from 54% to 68.4% at the scale of 7B, and from 63.9% to 74.0% at the scale of 13B). A log-linear relationship is presented between MuggleMath's performance and the amount of augmented data. We also find that MuggleMath is weak in out-of-domain math reasoning generalization to MATH. This is attributed to the differences in query distribution between AugGSM8K and MATH which suggest that augmentation on a single benchmark could not help with overall math reasoning performance. Codes and AugGSM8K will be uploaded to https://github.com/OFA-Sys/gsm8k-ScRel.
翻译:在基于大型语言模型(LLMs)的数学推理中,通过查询演化与多样化推理路径进行微调数据增强已被实证有效,显著缩小了开源LLMs与尖端商业LLMs之间的差距。本文针对数学推理中的此类数据增强展开研究,旨在回答以下问题:(1)哪些数据增强策略更为有效?(2)增强数据量与模型性能之间存在怎样的缩放关系?(3)数据增强能否促进对域外数学推理任务的泛化?为此,我们创建了新数据集AugGSM8K,通过对GSM8K中的查询进行复杂化与多样化处理,并采样多条推理路径。通过在AugGSM8K子集上微调,我们获得了一系列名为MuggleMath的LLMs。MuggleMath在GSM8K上取得了新的最优结果(7B规模下从54%提升至68.4%,13B规模下从63.9%提升至74.0%)。MuggleMath的性能与增强数据量之间存在对数线性关系。同时,我们发现MuggleMath在MATH域外数学推理泛化方面表现较弱。这归因于AugGSM8K与MATH之间查询分布的差异,表明在单一基准上的增强无法提升整体数学推理性能。代码与AugGSM8K将上传至https://github.com/OFA-Sys/gsm8k-ScRel。