Pre-trained Language Models (LMs) exhibit strong zero-shot and in-context learning capabilities; however, their behaviors are often difficult to control. By utilizing Reinforcement Learning from Human Feedback (RLHF), it is possible to fine-tune unsupervised LMs to follow instructions and produce outputs that reflect human preferences. Despite its benefits, RLHF has been shown to potentially harm a language model's reasoning capabilities and introduce artifacts such as hallucinations where the model may fabricate facts. To address this issue we introduce Direct Preference Heads (DPH), a fine-tuning framework that enables LMs to learn human preference signals through an auxiliary reward head without directly affecting the output distribution of the language modeling head. We perform a theoretical analysis of our objective function and find strong ties to Conservative Direct Preference Optimization (cDPO). Finally we evaluate our models on GLUE, RACE, and the GPT4All evaluation suite and demonstrate that our method produces models which achieve higher scores than those fine-tuned with Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO) alone.
翻译:预训练语言模型展现出强大的零样本和上下文学习能力,但其行为往往难以控制。通过基于人类反馈的强化学习,可以将无监督语言模型微调为能够遵循指令并生成符合人类偏好的输出。尽管具有优势,RLHF 已被证明可能损害语言模型的推理能力,并引入诸如幻觉(模型可能捏造事实)等伪影。为解决此问题,我们提出了直接偏好头,一种微调框架,使语言模型能够通过辅助奖励头学习人类偏好信号,而不会直接影响语言建模头的输出分布。我们对目标函数进行了理论分析,发现其与保守直接偏好优化存在紧密联系。最后,我们在GLUE、RACE和GPT4All评估套件上评估了我们的模型,结果表明我们的方法产生的模型比仅使用监督微调或直接偏好优化微调的模型获得了更高的分数。