Human preference alignment is essential to improve the interaction quality of large language models (LLMs). Existing aligning methods depend on manually annotated preference data to guide the LLM optimization directions. However, in practice, continuously updating LLMs raises a distribution gap between model-generated samples and human-preferred responses, which hinders model fine-tuning efficiency. To mitigate this issue, previous methods require additional preference annotation on generated samples to adapt the shifted distribution, which consumes a large amount of annotation resources. Targeting more efficient human preference optimization, we propose an adversarial preference optimization (APO) framework, where the LLM agent and the preference model update alternatively via a min-max game. Without additional annotation, our APO method can make a self-adaption to the generation distribution gap through the adversarial learning process. Based on comprehensive experiments, we find APO further enhances the alignment performance of baseline methods in terms of helpfulness and harmlessness. The code is at https://github.com/Linear95/APO.
翻译:人类偏好对齐对于提升大语言模型(LLMs)的交互质量至关重要。现有对齐方法依赖人工标注的偏好数据来引导LLM的优化方向。然而在实际应用中,持续更新的LLM会导致模型生成样本与人类偏好响应之间存在分布偏差,从而阻碍模型微调效率。为缓解这一问题,已有方法需对生成的样本进行额外偏好标注以适应偏移的分布,这消耗了大量标注资源。针对更高效的人类偏好优化,我们提出对抗偏好优化(APO)框架,其中LLM智能体与偏好模型通过极小极大博弈交替更新。在不增加额外标注的情况下,我们的APO方法能通过对抗学习过程自适应生成分布偏差。基于综合实验,我们发现APO在帮助性和无害性方面进一步提升了基线方法的对齐性能。代码见https://github.com/Linear95/APO。