Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values to ensure secure AI systems. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment. However, it encompasses two main drawbacks: (1) RLHF exhibits complexity, instability, and sensitivity to hyperparameters in contrast to SFT. (2) Despite massive trial-and-error, multiple sampling is reduced to pair-wise contrast, thus lacking contrasts from a macro perspective. In this paper, we propose Preference Ranking Optimization (PRO) as an efficient SFT algorithm to directly fine-tune LLMs for human alignment. PRO extends the pair-wise contrast to accommodate preference rankings of any length. By iteratively contrasting candidates, PRO instructs the LLM to prioritize the best response while progressively ranking the rest responses. In this manner, PRO effectively transforms human alignment into aligning the probability ranking of n responses generated by LLM with the preference ranking of humans towards these responses. Experiments have shown that PRO outperforms baseline algorithms, achieving comparable results to ChatGPT and human responses through automatic-based, reward-based, GPT-4, and human evaluations.
翻译:大型语言模型(LLMs)常包含误导性内容,因此需要将其与人类价值观对齐以确保构建安全的AI系统。基于人类反馈的强化学习(RLHF)已被用于实现这一对齐。然而,该方法主要存在两大缺陷:(1)与SFT相比,RLHF表现出复杂性、不稳定性以及对超参数的敏感性。(2)尽管经历了大量试错,但多次采样被简化为成对对比,从而缺乏宏观层面的对比。本文提出一种高效的SFT算法——偏好排序优化(PRO),用于直接微调LLMs以实现人类对齐。PRO将成对对比扩展至可处理任意长度的偏好排序。通过迭代对比候选响应,PRO引导LLM优先选择最佳响应,并逐步对其他响应进行排序。通过这种方式,PRO有效将人类对齐转化为使LLM生成的n个响应的概率排序与人类对这些响应的偏好排序相一致。实验表明,PRO在基于自动评估、奖励评估、GPT-4评估和人工评估中均优于基线算法,其效果可与ChatGPT及人类响应相媲美。