Pretrained language models often generate outputs that are not in line with human preferences, such as harmful text or factually incorrect summaries. Recent work approaches the above issues by learning from a simple form of human feedback: comparisons between pairs of model-generated outputs. However, comparison feedback only conveys limited information about human preferences. In this paper, we introduce Imitation learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback. ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements. Second, selecting the refinement incorporating the most feedback. Third, finetuning the language model to maximize the likelihood of the chosen refinement given the input. We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback. We evaluate ILF's effectiveness on a carefully-controlled toy task and a realistic summarization task. Our experiments demonstrate that large language models accurately incorporate feedback and that finetuning with ILF scales well with the dataset size, even outperforming finetuning on human summaries. Learning from both language and comparison feedback outperforms learning from each alone, achieving human-level summarization performance.
翻译:预训练语言模型常生成不符合人类偏好的输出,例如有害文本或事实错误的摘要。近期研究通过学习一种简单的人类反馈形式——模型生成输出间的成对比较——来应对上述问题。然而,比较反馈仅能传递有限的人类偏好信息。本文提出语言反馈模仿学习(ILF),一种利用更具信息量的语言反馈的新方法。ILF由三个迭代执行的步骤组成:首先,基于输入、初始语言模型输出及反馈来条件化语言模型以生成优化版本;其次,选择最能融合反馈的优化版本;第三,微调语言模型以最大化给定输入下所选优化版本的对数似然。我们从理论上证明ILF可被视为贝叶斯推断,与基于人类反馈的强化学习类似。我们在严控的玩具任务和实际摘要任务上评估ILF的有效性。实验表明,大型语言模型能精准融合反馈,且基于ILF的微调随数据集规模扩展性能优良,甚至超越基于人工摘要的微调。同时学习语言反馈与比较反馈的效果优于单独学习任一反馈,并达到人类水平的摘要性能。