Reinforcement learning (RL) has become a predominant technique to align language models (LMs) with human preferences or promote outputs which are deemed to be desirable by a given reward function. Standard RL approaches optimize average reward, while methods explicitly focused on reducing the probability of undesired outputs typically come at a cost to average-case performance. To improve this tradeoff, we introduce RePULSe, a new training method that augments the standard RL loss with an additional loss that uses learned proposals to guide sampling low-reward outputs, and then reduces those outputs' probability. We run experiments demonstrating that RePULSe produces a better tradeoff of expected reward versus the probability of undesired outputs and is more adversarially robust, compared to standard RL alignment approaches and alternatives.
翻译:强化学习已成为使语言模型与人类偏好对齐或促进给定奖励函数认为理想输出的主流技术。标准强化学习方法优化平均奖励,而明确关注降低不良输出概率的方法通常以牺牲平均性能为代价。为改善这一权衡,我们提出RePULSe——一种新的训练方法,该方法在标准强化学习损失函数基础上,增加了一个利用学习提案来引导低奖励输出采样并降低其概率的附加损失。实验表明,相较于标准强化学习对齐方法及其他替代方案,RePULSe在期望奖励与不良输出概率之间实现了更优的权衡,并具有更强的对抗鲁棒性。