Large language models have astounded the world with fascinating new capabilities. However, they currently lack the ability to teach themselves new skills, relying instead on being trained on large amounts of human-generated data. We introduce SECToR (Self-Education via Chain-of-Thought Reasoning), a proof-of-concept demonstration that language models can successfully teach themselves new skills using chain-of-thought reasoning. Inspired by previous work in both reinforcement learning (Silver et al., 2017) and human cognition (Kahneman, 2011), SECToR first uses chain-of-thought reasoning to slowly think its way through problems. SECToR then fine-tunes the model to generate those same answers, this time without using chain-of-thought reasoning. Language models trained via SECToR autonomously learn to add up to 29-digit numbers without any access to any ground truth examples beyond an initial supervised fine-tuning phase consisting only of numbers with 6 or fewer digits. Our central hypothesis is that chain-of-thought reasoning can act as a policy improvement operator, analogously to how Monte-Carlo Tree Search is used in AlphaZero. We hope that this research can lead to new directions in which language models can learn to teach themselves without the need for human demonstrations.
翻译:大语言模型以其令人惊叹的新能力震惊了世界。然而,它们目前缺乏自我学习新技能的能力,只能依赖于大量人类生成的数据进行训练。我们提出了SECToR(通过链式推理进行自我教育),这是一个概念验证,表明语言模型能够通过链式推理成功自我学习新技能。受到强化学习(Silver等人,2017)和人类认知(Kahneman,2011)先前研究的启发,SECToR首先利用链式推理逐步思考问题,然后微调模型以生成相同答案,但此次不再使用链式推理。通过SECToR训练的语言模型能够自主学会对长达29位数字进行加法运算,且无需接触任何真值示例,仅依赖初始监督微调阶段中位数不超过6位的数字。我们的核心假设是,链式推理可以作为策略改进算子,类似于AlphaZero中蒙特卡洛树搜索的作用。我们希望这项研究能引导语言模型在无需人类示范的情况下实现自我学习的新方向。