Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RLHF algorithm with provable guarantees. Motivated by the fact that offline preference dataset provides informative states (i.e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution. In theory, we show that DR-PO learns to perform at least as good as any policy that is covered by the offline dataset under general function approximation with finite sample complexity. In experiments, we demonstrate that on both the TL;DR summarization and the Anthropic Helpful Harmful (HH) dataset, the generation from DR-PO is better than that from Proximal Policy Optimization (PPO) and Direction Preference Optimization (DPO), under the metric of GPT4 win-rate. Code for this work can be found at https://github.com/Cornell-RL/drpo.
翻译:基于人类偏好反馈的强化学习是一种流行的生成模型微调方法,已成功催生了GPT-4和Claude3 Opus等优秀模型。该框架通常包含两个步骤:首先从离线偏好数据集中学习奖励模型,随后运行在线强化学习以优化该奖励模型。受"重置"思想的启发,本文提出了一种具有可验证保证的新型RLHF算法。考虑到离线偏好数据集能够提供信息丰富的状态(即标注者偏好的数据),我们的新算法——数据集重置策略优化(DR-PO)通过数据集重置机制将现有离线偏好数据集融入在线策略训练过程:该方法直接将策略优化器重置到离线数据集中的状态,而非始终从初始状态分布启动理论证明DR-PO在一般函数逼近与有限样本复杂度条件下,其学习性能至少不低于离线数据集覆盖的任何策略。实验表明,在TL;DR摘要数据集和Anthropic Helpful Harmful数据集上,以GPT-4胜率作为评判指标,DR-PO生成的文本质量优于近端策略优化(PPO)与直接偏好优化(DPO)方法。本工作的代码已开源至https://github.com/Cornell-RL/drpo。