Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences. RLHF contains three steps, i.e., human preference collecting, reward learning, and policy optimization, which are usually performed serially. Despite its popularity, however, (fixed) reward models may suffer from inaccurate off-distribution, since policy optimization continuously shifts LLMs' data distribution. Repeatedly collecting new preference data from the latest LLMs may alleviate this issue, which unfortunately makes the resulting system more complicated and difficult to optimize. In this paper, we propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution. Specifically, an unsupervised multi-view learning method is introduced to learn robust representations of policy samples. Meanwhile, a synthetic preference generation approach is developed to simulate high-quality preference data with policy outputs. Extensive experiments on three benchmark datasets show that RLP consistently outperforms the state-of-the-art. Our code is available at \url{https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/rlp}.
翻译:基于人类反馈的强化学习(RLHF)已成为将大型语言模型(LLMs)与人类偏好对齐的有效方法。RLHF包含三个步骤:人类偏好收集、奖励学习和策略优化,这些步骤通常串行执行。然而,尽管RLHF广受欢迎,(固定的)奖励模型可能因策略优化持续改变LLMs的数据分布而遭受不准确的分布外预测问题。反复从最新LLMs收集新的偏好数据可缓解此问题,但这会使系统更加复杂且难以优化。本文提出"基于策略的奖励学习"(RLP)——一种无监督框架,通过策略样本精炼奖励模型以保持其在分布内。具体而言,我们引入了一种无监督多视角学习方法,用于学习策略样本的鲁棒表征;同时开发了一种合成偏好生成方法,用于利用策略输出模拟高质量偏好数据。在三个基准数据集上的大量实验表明,RLP持续优于现有最优方法。我们的代码已开源在\url{https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/rlp}。