Aligning large language models (LLMs) with human preferences is critical to recent advances in generative artificial intelligence. Reinforcement learning from human feedback (RLHF) is widely applied to achieve this objective. A key step in RLHF is to learn the reward function from human feedback. However, human feedback is costly and time-consuming, making it essential to collect high-quality conversation data for human teachers to label. Additionally, different human teachers have different levels of expertise. It is thus critical to query the most appropriate teacher for their opinions. In this paper, we use offline reinforcement learning (RL) to formulate the alignment problem. Motivated by the idea of $D$-optimal design, we first propose a dual active reward learning algorithm for the simultaneous selection of conversations and teachers. Next, we apply pessimistic RL to solve the alignment problem, based on the learned reward estimator. Theoretically, we show that the reward estimator obtained through our proposed adaptive selection strategy achieves minimal generalized variance asymptotically, and prove that the sub-optimality of our pessimistic policy scales as $O(1/\sqrt{T})$ with a given sample budget $T$. Through simulations and experiments on LLMs, we demonstrate the effectiveness of our algorithm and its superiority over state-of-the-arts.
翻译:将大型语言模型(LLMs)与人类偏好对齐是生成式人工智能近期进展的关键。基于人类反馈的强化学习(RLHF)被广泛用于实现这一目标。RLHF的一个关键步骤是从人类反馈中学习奖励函数。然而,人类反馈成本高昂且耗时,因此必须为人类教师收集高质量的对话数据进行标注。此外,不同的人类教师具有不同水平的专业知识。因此,向最合适的教师征询意见至关重要。在本文中,我们使用离线强化学习(RL)来形式化对齐问题。受$D$-最优设计思想的启发,我们首先提出了一种双重主动奖励学习算法,用于同时选择对话和教师。接着,我们应用悲观RL,基于学习到的奖励估计器来解决对齐问题。理论上,我们证明了通过我们提出的自适应选择策略获得的奖励估计器能够渐近地达到最小广义方差,并证明了在给定样本预算$T$下,我们悲观策略的次优性以$O(1/\sqrt{T})$的速率收敛。通过在LLMs上的仿真和实验,我们验证了算法的有效性及其相对于现有技术的优越性。