Human preference alignment is a crucial training step to improve the interaction quality of large language models (LLMs). Existing aligning methods depend on manually annotated preference data to guide the LLM optimization directions. However, in practice, continuously updating LLMs raises a distribution gap between model-generated samples and human-preferred responses, which hinders model fine-tuning efficiency. To mitigate this issue, previous methods require additional preference annotation on generated samples to adapt the shifted distribution, which consumes a large amount of annotation resources. Targeting more efficient human preference optimization, we propose an adversarial preference optimization (APO) framework, where the LLM agent and the preference model update alternatively via a min-max game. Without additional annotation, our APO method can make a self-adaption to the generation distribution gap through the adversarial learning process. In experiments, we empirically verify the effectiveness of APO in improving LLM's helpfulness and harmlessness compared with rejection sampling baselines.
翻译:人类偏好对齐是提升大语言模型(LLMs)交互质量的关键训练步骤。现有对齐方法依赖人工标注的偏好数据来引导LLM的优化方向。然而,在实践中,持续更新的LLM会导致模型生成样本与人类偏好响应之间存在分布差距,从而阻碍模型微调效率。为缓解该问题,先前方法需要对生成样本进行额外的偏好标注以适配偏移的分布,这会消耗大量标注资源。针对更高效的人类偏好优化,我们提出对抗性偏好优化(APO)框架,其中LLM智能体与偏好模型通过极小极大博弈交替更新。无需额外标注,我们的APO方法可通过对抗学习过程对生成分布差距实现自适应调整。实验结果表明,与拒绝采样基线相比,APO在提升LLM的有用性和无害性方面具有有效性。