Language models (LMs) often exhibit undesirable text generation behaviors, including generating false, toxic, or irrelevant outputs. Reinforcement learning from human feedback (RLHF) - where human preference judgments on LM outputs are transformed into a learning signal - has recently shown promise in addressing these issues. However, such holistic feedback conveys limited information on long text outputs; it does not indicate which aspects of the outputs influenced user preference; e.g., which parts contain what type(s) of errors. In this paper, we use fine-grained human feedback (e.g., which sentence is false, which sub-sentence is irrelevant) as an explicit training signal. We introduce Fine-Grained RLHF, a framework that enables training and learning from reward functions that are fine-grained in two respects: (1) density, providing a reward after every segment (e.g., a sentence) is generated; and (2) incorporating multiple reward models associated with different feedback types (e.g., factual incorrectness, irrelevance, and information incompleteness). We conduct experiments on detoxification and long-form question answering to illustrate how learning with such reward functions leads to improved performance, supported by both automatic and human evaluation. Additionally, we show that LM behaviors can be customized using different combinations of fine-grained reward models. We release all data, collected human feedback, and codes at https://FineGrainedRLHF.github.io.
翻译:语言模型(LMs)常表现出不良文本生成行为,包括生成虚假、有害或无关的输出。基于人类反馈的强化学习(RLHF)——将人类对LM输出的偏好判断转化为学习信号——最近在解决这些问题上显示出潜力。然而,这种整体反馈在长文本输出上传达的信息有限;它未指明输出的哪些方面影响了用户偏好,例如哪些部分包含何种类型的错误。在本文中,我们使用细粒度人类反馈(例如,哪个句子是虚假的,哪个子句是无关的)作为显式训练信号。我们引入细粒度RLHF框架,该框架能够训练并学习来自两个维度细化的奖励函数:(1)密度,即在每个片段(如句子)生成后提供奖励;(2)整合与不同反馈类型(如事实错误、无关性和信息不完整性)关联的多个奖励模型。我们在去毒化和长文本问答任务上进行实验,通过自动评估和人工评估表明,使用此类奖励函数学习可提升性能。此外,我们展示通过不同细粒度奖励模型组合可定制LM行为。我们在https://FineGrainedRLHF.github.io发布所有数据、收集的人类反馈及代码。