Mathematical reasoning is a challenging task for large language models (LLMs), while the scaling relationship of it with respect to LLM capacity is under-explored. In this paper, we investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM. We find that pre-training loss is a better indicator of the model's performance than the model's parameter count. We apply supervised fine-tuning (SFT) with different amounts of supervised data and empirically find a log-linear relation between data amount and model performance, and we find better models improve less with enlarged supervised datasets. To augment more data samples for improving model performances without any human effort, we propose to apply Rejection sampling Fine-Tuning (RFT). RFT uses supervised models to generate and collect correct reasoning paths as augmented fine-tuning datasets. We find with augmented samples containing more distinct reasoning paths, RFT improves mathematical reasoning performance more for LLMs. We also find RFT brings more improvement for less performant LLMs. Furthermore, we combine rejection samples from multiple models which push LLaMA-7B to an accuracy of 49.3\% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9\% significantly.
翻译:数学推理对大型语言模型(LLMs)而言是一项具有挑战性的任务,然而其相对于LLM能力的缩放关系尚未得到充分探索。本文研究了预训练损失、监督数据量以及增强数据量如何影响监督LLM的推理性能。我们发现,相较于模型参数数量,预训练损失是衡量模型性能更优的指标。我们使用不同数量的监督数据进行监督微调(SFT),实验发现数据量与模型性能之间存在对数线性关系,且性能更优的模型随着监督数据集扩大而提升幅度减小。为在无需人工干预的情况下通过扩充数据样本来提升模型性能,我们提出采用拒绝采样微调(RFT)。RFT利用监督模型生成并收集正确的推理路径作为增强微调数据集。我们发现,当增强样本包含更多不同推理路径时,RFT对LLM数学推理性能的提升更为显著。同时,RFT对性能较弱的LLM带来的改进更大。此外,通过融合多个模型的拒绝采样结果,我们将LLaMA-7B在GSM8K上的准确率提升至49.3%,显著优于监督微调(SFT)的35.9%准确率。