Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
翻译:自我纠正是大型语言模型(LLM)一项极其理想的能力,然而在当代LLM中,该能力始终被证明在很大程度上是无效的。当前训练自我纠正的方法通常依赖于多个模型、更先进的模型或其他形式的监督。为克服这些不足,我们开发了一种多轮在线强化学习(RL)方法——SCoRe,该方法完全利用自生成数据,显著提升了LLM的自我纠正能力。为构建SCoRe,我们首先证明,在离线模型生成的纠正轨迹上进行监督微调(SFT)的变体通常不足以灌输自我纠正行为。具体而言,我们观察到,通过SFT进行的训练容易陷入两种困境:一是数据收集策略所犯错误与模型自身响应之间的分布不匹配;二是行为崩溃,即学习过程隐式地偏好仅某种特定的纠正行为模式,而该模式在测试问题的自我纠正上往往无效。SCoRe通过基于模型自生成的纠正轨迹分布进行训练,并利用适当的正则化引导学习过程,从而解决了这些挑战。该过程旨在学习一种在测试时有效的自我纠正行为,而非简单地拟合给定提示下的高奖励响应。此正则化过程包括:首先在基础模型上进行多轮RL以生成一个不易崩溃的策略初始化,随后使用奖励加成来增强自我纠正能力。在Gemini 1.0 Pro和1.5 Flash模型上的实验表明,SCoRe实现了最先进的自我纠正性能,在MATH和HumanEval数据集上,分别将基础模型的自我纠正能力提升了15.6%和9.1%。