Human preference alignment is essential to improve the interaction quality of large language models (LLMs). Existing alignment methods depend on manually annotated preference data to guide the LLM optimization directions. However, continuously updating LLMs for alignment raises a distribution gap between model-generated samples and human-annotated responses, hindering training effectiveness. To mitigate this issue, previous methods require additional preference annotation on newly generated samples to adapt to the shifted distribution, which consumes a large amount of annotation resources. Targeting more efficient human preference optimization, we propose an Adversarial Preference Optimization (APO) framework, in which the LLM and the reward model update alternatively via a min-max game. Through adversarial training, the reward model can adapt to the shifted generation distribution of the LLM without any additional annotation. With comprehensive experiments, we find the proposed adversarial training framework further enhances existing alignment baselines in terms of LLM helpfulness and harmlessness. The code is at https://github.com/Linear95/APO.
翻译:人类偏好对齐对于提升大语言模型(LLM)的交互质量至关重要。现有对齐方法依赖人工标注的偏好数据来指导LLM的优化方向。然而,持续更新LLM以实现对齐会导致模型生成样本与人工标注响应之间产生分布差距,从而阻碍训练效果。为缓解此问题,先前方法需对新生成样本进行额外的偏好标注以适应分布偏移,这消耗了大量标注资源。针对更高效的人类偏好优化,我们提出了一种对抗性偏好优化(APO)框架,其中LLM与奖励模型通过极小极大博弈交替更新。通过对抗训练,奖励模型能够适应LLM的生成分布偏移,且无需任何额外标注。综合实验表明,所提出的对抗训练框架在LLM的有益性和无害性方面进一步增强了现有对齐基线。代码位于 https://github.com/Linear95/APO。