Reinforcement Learning from Human Feedback (RLHF) leverages human preference data to train language models to align more closely with human essence. These human preference data, however, are labeled at the sequence level, creating a mismatch between sequence-level preference labels and tokens, which are autoregressively generated from the language model. Although several recent approaches have tried to provide token-level (i.e., dense) rewards for each individual token, these typically rely on predefined discrete reward values (e.g., positive: +1, negative: -1, neutral: 0), failing to account for varying degrees of preference inherent to each token. To address this limitation, we introduce TLCR (Token-Level Continuous Reward) for RLHF, which incorporates a discriminator trained to distinguish positive and negative tokens, and the confidence of the discriminator is used to assign continuous rewards to each token considering the context. Extensive experiments show that our proposed TLCR leads to consistent performance improvements over previous sequence-level or token-level discrete rewards on open-ended generation benchmarks.
翻译:基于人类反馈的强化学习(RLHF)利用人类偏好数据训练语言模型,使其更贴近人类本质。然而,这些人类偏好数据是在序列级别标注的,导致序列级偏好标签与语言模型自回归生成的令牌之间存在不匹配。尽管近期已有若干方法尝试为每个独立令牌提供令牌级(即密集)奖励,但这些方法通常依赖于预定义的离散奖励值(例如:正向+1、负向-1、中性0),未能考虑每个令牌内在的偏好程度差异。为克服这一局限,我们提出用于RLHF的TLCR(令牌级连续奖励)方法,该方法引入一个经过训练以区分正负令牌的判别器,并利用判别器的置信度,结合上下文为每个令牌分配连续奖励。大量实验表明,在开放式生成基准测试中,我们提出的TLCR相较于先前的序列级或令牌级离散奖励方法,能带来持续的性能提升。