To ensure that large language model (LLM) responses are helpful and non-toxic, we usually fine-tune a reward model on human preference data. We then select policy responses with high rewards (best-of-n sampling) or further optimize the policy to produce responses with high rewards (reinforcement learning from human feedback). However, this process is vulnerable to reward overoptimization or hacking, in which the responses selected have high rewards due to errors in the reward model rather than a genuine preference. This is especially problematic as the prompt or response diverges from the training data. It should be possible to mitigate these issues by training a Bayesian reward model, which signals higher uncertainty further from the training data distribution. Therefore, we trained Bayesian reward models using Laplace-LoRA (Yang et al., 2024) and found that the resulting uncertainty estimates can successfully mitigate reward overoptimization in best-of-n sampling.
翻译:为保障大语言模型(LLM)的响应兼具实用性与无害性,我们通常基于人类偏好数据微调奖励模型。随后,我们选择高奖励策略响应(最优n采样)或进一步优化策略以生成高奖励响应(基于人类反馈的强化学习)。然而,这一过程易受奖励过度优化或攻击行为的影响——即所选响应虽获得高奖励,但源于奖励模型自身的误差而非真实偏好。当提示或响应偏离训练数据分布时,此类问题尤为突出。通过训练贝叶斯奖励模型(该模型能在远离训练数据分布的区间表征更高的不确定性),应当能够缓解上述问题。为此,我们采用Laplace-LoRA(Yang等人,2024)训练了贝叶斯奖励模型,实验表明其生成的不确定性估计能成功缓解最优n采样中的奖励过度优化现象。