One way to enhance the reasoning capability of Large Language Models (LLMs) is to conduct Supervised Fine-Tuning (SFT) using Chain-of-Thought (CoT) annotations. This approach does not show sufficiently strong generalization ability, however, because the training only relies on the given CoT data. In math problem-solving, for example, there is usually only one annotated reasoning path for each question in the training data. Intuitively, it would be better for the algorithm to learn from multiple annotated reasoning paths given a question. To address this issue, we propose a simple yet effective approach called Reinforced Fine-Tuning (ReFT) to enhance the generalizability of learning LLMs for reasoning, with math problem-solving as an example. ReFT first warmups the model with SFT, and then employs on-line reinforcement learning, specifically the PPO algorithm in this paper, to further fine-tune the model, where an abundance of reasoning paths are automatically sampled given the question and the rewards are naturally derived from the ground-truth answers. Extensive experiments on GSM8K, MathQA, and SVAMP datasets show that ReFT significantly outperforms SFT, and the performance can be potentially further boosted by combining inference-time strategies such as majority voting and re-ranking. Note that ReFT obtains the improvement by learning from the same training questions as SFT, without relying on extra or augmented training questions. This indicates a superior generalization ability for ReFT.
翻译:提升大语言模型(LLMs)推理能力的一种方法是利用思维链(CoT)标注进行监督微调(SFT)。然而,由于训练仅依赖给定的CoT数据,该方法未能展现出足够强的泛化能力。以数学问题求解为例,训练数据中每个问题通常只有一条标注的推理路径。直观而言,若算法能针对同一问题从多条标注推理路径中学习,效果将更优。为解决此问题,我们提出一种简洁高效的强化微调(ReFT)方法,以数学问题求解为例增强LLMs推理学习的泛化能力。ReFT首先通过SFT对模型进行预热,随后采用在线强化学习(本文中具体使用PPO算法)进一步微调模型。在此过程中,模型可基于问题自动采样大量推理路径,并根据真实答案自然衍生奖励信号。在GSM8K、MathQA和SVAMP数据集上的大量实验表明,ReFT显著优于SFT,且通过结合多数投票、重排序等推理时策略可进一步提升性能。值得注意的是,ReFT仅从与SFT相同的训练问题中学习,无需依赖额外或增强的训练数据即可获得性能提升,这体现了其卓越的泛化能力。