Chain of Hindsight Aligns Language Models with Feedback

Learning from human preferences is important for language models to be helpful and useful for humans, and to align with human and social values. Prior work have achieved remarkable successes by learning from human feedback to understand and follow instructions. They belong to two categories supervised finetuning and RLHF. Supervised finetuning is based on curated model generations that are preferred by human labelers, a key limitation of them is that supervised finetuning cannot learn from negative ratings; models are only trained on positive feedback, which makes it data inefficient and difficult to generalize. While RLHF can learn from all feedback by learning a reward function and RL optimization, it suffers from imperfect reward function and RL is very hard to tune. In this work, we propose a novel technique that addresses the limitations of both supervised finetuning and RLHF, our method, Chain of Hindsight, aligns language models with all feedback without using reinforcement learning. Our idea is motivated by how humans learn from hindsight experience, and we turn all feedback into a sentence to finetune model in order to leverage the language understanding abilities of language models. We condition the model on a sequence of model generations paired with hindsight feedback, and finetune the model to predict the most preferred output. By doing so, models can learn to identify and correct negative attributes or errors. Applying our method to GPT-J, we observe that it substantially outperforms both supervised finetuning and RLHF on summarization and dialogue tasks and is significantly more preferred in human evaluations.

翻译：从人类偏好中学习对于语言模型对人类而言具有帮助性和实用性，并使其与人类及社会价值观保持一致至关重要。先前的研究通过从人类反馈中学习来理解和遵循指令已取得显著成功，这些方法主要分为两类：监督微调与强化学习从人类反馈（RLHF）。监督微调基于人类标注者偏好的精心筛选模型生成结果，其关键局限性在于无法从负面评价中学习——模型仅基于正向反馈进行训练，导致数据效率低下且难以泛化。而RLHF虽能通过学习奖励函数和强化学习优化从所有反馈中学习，但受限于不完美的奖励函数，且强化学习本身难以调优。本研究提出了一种新方法，旨在解决监督微调与RLHF两者的局限性。我们的方法——链式回溯（Chain of Hindsight）——无需使用强化学习即可将语言模型与所有反馈对齐。该方法的灵感来源于人类如何从经验中回溯学习，我们将所有反馈转化为句子以微调模型，从而利用语言模型的语言理解能力。通过将模型置于配对回溯反馈的模型生成序列条件下，微调模型预测最受偏好的输出。如此，模型能学会识别并纠正负面属性或错误。将本方法应用于GPT-J时，我们观察到其在摘要和对话任务上显著优于监督微调与RLHF，并在人类评估中获得更明显的偏好。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

专知会员服务

28+阅读 · 2022年2月20日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日