Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) to human preferences at the time of decoding. BoN sampling is susceptible to a problem known as reward hacking. Because the reward model is an imperfect proxy for the true objective, over-optimizing its value can compromise its performance on the true objective. A common solution to prevent reward hacking in preference learning techniques is to optimize a reward using proximity regularization (e.g., KL regularization), which ensures that the language model remains close to the reference model. In this research, we propose Regularized Best-of-N (RBoN), a variant of BoN that aims to mitigate reward hacking by incorporating a proximity term in response selection, similar to preference learning techniques. We evaluate two variants of RBoN on the AlpacaFarm dataset and find that they outperform BoN, especially when the proxy reward model has a low correlation with the true objective.
翻译:最佳-N(BoN)采样结合奖励模型已被证明是在解码阶段将大型语言模型(LLMs)与人类偏好对齐的有效策略。然而,BoN采样容易受到一种称为“奖励作弊”问题的影响。由于奖励模型是对真实目标的不完美代理,过度优化其值可能会损害其在真实目标上的性能。在偏好学习技术中,防止奖励作弊的常见解决方案是使用邻近正则化(如KL正则化)来优化奖励,从而确保语言模型保持接近参考模型。在本研究中,我们提出了正则化最佳-N(RBoN),这是BoN的一种变体,旨在通过在选择响应时纳入邻近项来减轻奖励作弊,类似于偏好学习技术。我们在AlpacaFarm数据集上评估了RBoN的两种变体,并发现在代理奖励模型与真实目标相关性较低时,它们优于BoN。