Language models (LMs) are pretrained to imitate internet text, including content that would violate human preferences if generated by an LM: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, and more. Here, we explore alternative objectives for pretraining LMs in a way that also guides them to generate text aligned with human preferences. We benchmark five objectives for pretraining with human feedback across three tasks and study how they affect the trade-off between alignment and capabilities of pretrained LMs. We find a Pareto-optimal and simple approach among those we explored: conditional training, or learning distribution over tokens conditional on their human preference scores given by a reward model. Conditional training reduces the rate of undesirable content by up to an order of magnitude, both when generating without a prompt and with an adversarially-chosen prompt. Moreover, conditional training maintains the downstream task performance of standard LM pretraining, both before and after task-specific finetuning. Pretraining with human feedback results in much better preference satisfaction than standard LM pretraining followed by finetuning with feedback, i.e., learning and then unlearning undesirable behavior. Our results suggest that we should move beyond imitation learning when pretraining LMs and incorporate human preferences from the start of training.
翻译:语言模型(LMs)通过预训练模仿互联网文本,这一过程包含了若由LM生成会违反人类偏好的内容:虚假信息、攻击性评论、个人身份信息、低质量或有缺陷的代码等。在此,我们探索了以引导模型生成与人类偏好对齐文本为导向的替代性预训练目标。我们针对三项任务基准测试了五种基于人类反馈的预训练目标,并研究了这些目标如何影响预训练语言模型在对齐性与能力之间的权衡。在所探索的方法中,我们发现了一种帕累托最优且简单的方案:条件训练——即基于奖励模型给出的人类偏好评分,学习令牌的条件分布。无论在没有提示词还是使用对抗性选择的提示词时,条件训练均能将不良内容生成率降低多达一个数量级。此外,条件训练在标准LM预训练的下游任务性能(包括任务特定微调前后)保持不变。相较于"先通过标准LM预训练学习不良行为,再通过反馈微调去除不良行为"的范式,基于人类反馈的预训练能显著提升偏好满足度。我们的结果表明:在预训练语言模型时,应当超越模仿学习范式,从训练伊始就将人类偏好纳入考量。