Aligning large language models (LLM) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human feedback (RLHF). Despite their superior performance, current RLHF approaches often require a large amount of human-labelled preference data, which is expensive to collect. In this paper, inspired by the success of active learning, we address this problem by proposing query-efficient RLHF methods. We first formalize the alignment problem as a contextual dueling bandit problem and design an active-query-based proximal policy optimization (APPO) algorithm with an $\tilde{O}(d^2/\Delta)$ regret bound and an $\tilde{O}(d^2/\Delta^2)$ query complexity, where $d$ is the dimension of feature space and $\Delta$ is the sub-optimality gap over all the contexts. We then propose ADPO, a practical version of our algorithm based on direct preference optimization (DPO) and apply it to fine-tuning LLMs. Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of the state-of-the-art DPO method.
翻译:将大型语言模型与人类偏好对齐在构建现代生成模型中扮演着关键角色,这可通过基于人类反馈的强化学习实现。尽管现有RLHF方法性能优异,但通常需要大量昂贵的人工标注偏好数据。受主动学习成功经验的启发,本文提出查询高效的RLHF方法来解决该问题。我们首先将对齐问题形式化为上下文对抗性"决斗式"赌博机问题,并设计了基于主动查询的近似策略优化算法,其遗憾界为$\tilde{O}(d^2/\Delta)$,查询复杂度为$\tilde{O}(d^2/\Delta^2)$,其中$d$是特征空间维度,$\Delta$是所有上下文上的次优性差距。随后基于直接偏好优化,我们提出该算法的实用版本ADPO,并将其应用于大语言模型微调。实验表明,ADPO仅需约一半的人类偏好查询量,即可匹配当前最优DPO方法的性能水平。