Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values to ensure secur AI systems. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment by combining a reward model, typically based on Bradley-Terry paired comparison, with an RL algorithm such as Proximal Policy Optimization (PPO) to optimize LLM responses. However, RLHF exhibits complexity, instability, and sensitivity to hyperparameters. In this paper, we propose Preference Ranking Optimization (PRO) as an alternative to PPO for directly aligning LLMs with the Bradley-Terry comparison. PRO extends the pairwise Bradley-Terry comparison to accommodate preference rankings of any length. By iteratively contrasting the likelihood of generating responses, PRO instructs the LLM to prioritize the best response while progressively ranking the remaining responses. In this manner, PRO effectively transforms human alignment into aligning the probability ranking of $n$ responses generated by LLM with the preference ranking of humans towards these responses. Experiments have shown that PRO outperforms existing alignment algorithms, achieving comparable results to ChatGPT and human responses through automatic-based, reward-based, GPT-4, and human evaluations. Furthermore, we demonstrate that longer, more diverse, and higher-quality preference ranking sequences can consistently enhance the performance of human alignment.
翻译:大型语言模型(LLMs)常包含误导性内容,因此需要使其与人类价值观对齐,以确保人工智能系统的安全性。基于人类反馈的强化学习(RLHF)通过结合基于Bradley-Terry成对比较的奖励模型与近端策略优化(PPO)等强化学习算法来优化LLM输出,从而实现这种对齐。然而,RLHF存在复杂性、不稳定性和对超参数敏感的问题。本文提出偏好排名优化(PRO)作为PPO的替代方案,用于直接利用Bradley-Terry比较对齐LLM。PRO将成对Bradley-Terry比较扩展为可处理任意长度的偏好排名。通过迭代对比生成响应的似然性,PRO引导LLM优先选择最佳响应,同时逐步对剩余响应进行排序。通过这种方式,PRO有效将人类对齐转化为将LLM生成的$n$个响应的概率排名与人类对这些响应的偏好排名进行对齐。实验表明,PRO优于现有对齐算法,在自动化评估、基于奖励的评估、GPT-4评估和人工评估中均达到与ChatGPT及人类响应相当的结果。此外,我们证明更长、更多样化且更高质量的偏好排名序列能持续提升人类对齐的性能。