Preference learning (PL) with large language models (LLMs) aims to align the LLMs' generations with human preferences. Previous work on reinforcement learning from human feedback (RLHF) has demonstrated promising results in in-distribution PL. However, due to the difficulty of obtaining human feedback, discretely training reward models for every encountered distribution is challenging. Thus, out-of-distribution (OOD) PL is practically useful for enhancing the generalization ability of LLMs with limited preference feedback. This work addresses OOD PL by optimizing a general reward model through a meta-learning approach. During meta-training, a bilevel optimization algorithm is utilized to learn a reward model capable of guiding policy learning to align with human preferences across various distributions. When encountering a test distribution, the meta-test procedure conducts regularized policy optimization using the learned reward model for PL. We theoretically demonstrate the convergence rate of the bilevel optimization algorithm under reasonable assumptions. Additionally, we conduct experiments on two text generation tasks across 20 held-out domains and outperform a variety of strong baselines across various evaluation metrics.
翻译:偏好学习旨在通过大规模语言模型(LLMs)使模型生成内容与人类偏好对齐。先前基于人类反馈的强化学习(RLHF)工作在分布内偏好学习中取得了显著成果。然而,由于人类反馈获取困难,针对每种分布单独训练奖励模型具有挑战性。因此,分布外(OOD)偏好学习对于提升有限偏好反馈下LLMs的泛化能力具有实际价值。本文通过元学习方法优化通用奖励模型来解决OOD偏好学习问题。在元训练阶段,利用双层优化算法学习一个能够引导策略学习与不同分布下人类偏好对齐的奖励模型。当面临测试分布时,元测试过程通过已学习的奖励模型执行正则化策略优化以进行偏好学习。我们在合理假设下从理论上证明了双层优化算法的收敛速率。此外,在20个保留域上的两项文本生成任务中,我们的方法在多种评估指标上均优于强基线模型。