The self-improving ability of large language models (LLMs), enabled by prompting them to analyze and revise their own outputs, has garnered significant interest in recent research. However, this ability has been shown to be absent and difficult to learn for smaller models, thus widening the performance gap between state-of-the-art LLMs and more cost-effective and faster ones. To reduce this gap, we introduce TriPosT, a training algorithm that endows smaller models with such self-improvement ability, and show that our approach can improve a LLaMA-7b's performance on math and reasoning tasks by up to 7.13%. In contrast to prior work, we achieve this by using the smaller model to interact with LLMs to collect feedback and improvements on its own generations. We then replay this experience to train the small model. Our experiments on four math and reasoning datasets show that the interactive experience of learning from and correcting its own mistakes is crucial for small models to improve their performance.
翻译:大语言模型通过提示分析并修正自身输出而展现的自我改进能力,近年来引起了广泛研究兴趣。然而研究表明,较小模型缺乏这种能力且难以习得,这进一步拉大了最先进大语言模型与更具成本效益、速度更快的模型之间的性能鸿沟。为弥合这一差距,我们提出了TriPosT训练算法,赋予较小模型这种自我改进能力,并证明该方法可使LLaMA-7b在数学与推理任务上的性能提升高达7.13%。与现有工作不同,我们通过让较小模型与大语言模型交互来收集对其自身生成的反馈与改进建议,再将这段经验进行回放训练小模型。在四个数学与推理数据集上的实验表明,从自身错误中学习并修正的交互式经验,对于小模型提升性能至关重要。