One way to enhance the reasoning capability of Large Language Models (LLMs) is to conduct Supervised Fine-Tuning (SFT) using Chain-of-Thought (CoT) annotations. This approach does not show sufficiently strong generalization ability, however, because the training only relies on the given CoT data. In math problem-solving, for example, there is usually only one annotated reasoning path for each question in the training data. Intuitively, it would be better for the algorithm to learn from multiple annotated reasoning paths given a question. To address this issue, we propose a simple yet effective approach called Reinforced Fine-Tuning (ReFT) to enhance the generalizability of learning LLMs for reasoning, with math problem-solving as an example. ReFT first warmups the model with SFT, and then employs on-line reinforcement learning, specifically the PPO algorithm in this paper, to further fine-tune the model, where an abundance of reasoning paths are automatically sampled given the question and the rewards are naturally derived from the ground-truth answers. Extensive experiments on GSM8K, MathQA, and SVAMP datasets show that ReFT significantly outperforms SFT, and the performance can be potentially further boosted by combining inference-time strategies such as majority voting and re-ranking. Note that ReFT obtains the improvement by learning from the same training questions as SFT, without relying on extra or augmented training questions. This indicates a superior generalization ability for ReFT.
翻译:提升大型语言模型(LLM)推理能力的一种方法是通过思维链(CoT)标注进行监督微调(SFT)。然而,由于训练仅依赖于给定的CoT数据,这种方法未能展现出足够强的泛化能力。例如在数学问题求解中,训练数据中每个问题通常仅标注一条推理路径。直观上,算法若能基于同一问题学习多条标注推理路径将更为有效。为解决这一问题,我们提出了一种简单而有效的方法——强化微调(ReFT),以数学问题求解为例,旨在增强LLM推理学习的泛化能力。ReFT首先通过SFT对模型进行预热,随后采用在线强化学习(本文具体使用PPO算法)进一步微调模型。在此过程中,系统会针对给定问题自动采样大量推理路径,并根据真实答案自然生成奖励信号。在GSM8K、MathQA和SVAMP数据集上的大量实验表明,ReFT显著优于SFT,且其性能可通过结合多数投票和重排序等推理时策略得到进一步提升。值得注意的是,ReFT的改进是通过学习与SFT相同的训练问题实现的,无需依赖额外或增广的训练问题,这证明了ReFT具有更优的泛化能力。