The self-improving ability of large language models (LLMs), enabled by prompting them to analyze and revise their own outputs, has garnered significant interest in recent research. However, this ability has been shown to be absent and difficult to learn for smaller models, thus widening the performance gap between state-of-the-art LLMs and more cost-effective and faster ones. To reduce this gap, we introduce TriPosT, a training algorithm that endows smaller models with such self-improvement ability, and show that our approach can improve a LLaMA-7b's performance on math and reasoning tasks by up to 7.13%. In contrast to prior work, we achieve this by using the smaller model to interact with LLMs to collect feedback and improvements on its own generations. We then replay this experience to train the small model. Our experiments on four math and reasoning datasets show that the interactive experience of learning from and correcting its own mistakes is crucial for small models to improve their performance.
翻译:大型语言模型(LLM)通过提示其分析和修正自身输出而具备的自我改进能力,在近年研究中引起了广泛关注。然而,研究表明这种能力在小规模模型中缺失且难以学习,从而拉大了顶级LLM与更具成本效益、速度更快的模型之间的性能差距。为缩小这一差距,我们提出TriPosT训练算法,使较小模型具备此类自我改进能力,并证明该方法可将LLaMA-7b在数学与推理任务上的性能提升高达7.13%。与先前工作不同,我们通过让较小模型与LLM交互,收集对其自身生成的反馈与改进建议,并重放这一经验来训练小模型。在四个数学与推理数据集上的实验表明,从自身错误中学习并修正的交互式经验,对小模型提升性能至关重要。