One way to enhance the reasoning capability of Large Language Models (LLMs) is to conduct Supervised Fine-Tuning (SFT) using Chain-of-Thought (CoT) annotations. This approach does not show sufficiently strong generalization ability, however, because the training only relies on the given CoT data. In math problem-solving, for example, there is usually only one annotated reasoning path for each question in the training data. Intuitively, it would be better for the algorithm to learn from multiple annotated reasoning paths given a question. To address this issue, we propose a simple yet effective approach called Reinforced Fine-Tuning (ReFT) to enhance the generalizability of learning LLMs for reasoning, with math problem-solving as an example. ReFT first warmups the model with SFT, and then employs on-line reinforcement learning, specifically the PPO algorithm in this paper, to further fine-tune the model, where an abundance of reasoning paths are automatically sampled given the question and the rewards are naturally derived from the ground-truth answers. Extensive experiments on GSM8K, MathQA, and SVAMP datasets show that ReFT significantly outperforms SFT, and the performance can be potentially further boosted by combining inference-time strategies such as majority voting and re-ranking. Note that ReFT obtains the improvement by learning from the same training questions as SFT, without relying on extra or augmented training questions. This indicates a superior generalization ability for ReFT.
翻译:提升大型语言模型(LLM)推理能力的一种方法是利用思维链(CoT)标注进行监督微调(SFT)。然而,由于训练仅依赖于给定的CoT数据,这种方法未能展现出足够强的泛化能力。以数学问题求解为例,训练数据中每个问题通常仅标注一条推理路径。直观上,若算法能够从单个问题对应的多条标注推理路径中学习,效果应当更优。为解决这一问题,我们提出一种简单而有效的方法——强化微调(ReFT),以数学问题求解为例,增强LLM在推理任务中的泛化能力。ReFT首先通过SFT对模型进行预热,随后采用在线强化学习(本文具体使用PPO算法)进一步微调模型,在此过程中自动为每个问题采样大量推理路径,并根据真实答案自然生成奖励信号。在GSM8K、MathQA和SVAMP数据集上的大量实验表明,ReFT显著优于SFT,且其性能可通过结合多数投票、重排序等推理阶段策略得到进一步提升。值得注意的是,ReFT的改进仅通过SFT相同的训练问题进行学习获得,无需依赖额外或增广的训练问题,这证明了ReFT具有更优的泛化能力。