We introduce Adversarial Policy Optimization (AdvPO), a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs). Over-optimization occurs when a reward model serves as an imperfect proxy for human preference, and RL-driven policy optimization erroneously exploits reward inaccuracies. In this paper, we begin by introducing a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model, without the need for computationally expensive reward ensembles. AdvPO then addresses a distributionally robust optimization problem centred around the confidence interval of the reward model's predictions for policy improvement. Through comprehensive experiments on the Anthropic HH and TL;DR summarization datasets, we illustrate the efficacy of AdvPO in mitigating the overoptimization issue, consequently resulting in enhanced performance as evaluated through human-assisted evaluation.
翻译:本文提出对抗策略优化(Adversarial Policy Optimization, AdvPO),这是一种针对大型语言模型(LLMs)中基于人类反馈的强化学习(RLHF)普遍存在的奖励过度优化问题的新型解决方案。当奖励模型作为人类偏好的不完美代理时,过度优化现象便会出现,且强化学习驱动的策略优化会错误地利用奖励中的不准确性。本文首先引入一种轻量级方法来量化奖励中的不确定性,该方法仅依赖奖励模型最后一层的嵌入表示,无需计算昂贵的奖励集成。继而,AdvPO针对围绕策略改进的奖励模型预测置信区间,解决一个分布鲁棒优化问题。通过在Anthropic HH和TL;DR摘要数据集上的全面实验,我们展示了AdvPO在缓解过度优化问题方面的有效性,从而在人工辅助评估中取得了更优的性能表现。