Traditional language model alignment methods, such as Direct Preference Optimization (DPO), are limited by their dependence on static, pre-collected paired preference data, which hampers their adaptability and practical applicability. To overcome this limitation, we introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data. Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation. Specifically, we employ an Exponential Moving Average (EMA) model in conjunction with a replay buffer to enable dynamic updates of response segments, effectively integrating real-time feedback with insights from historical data. Our comprehensive evaluations of the LLaMA3-8B and Mistral-7B models across benchmarks, including the Open LLM Leaderboard, IFEval, AlpacaEval 2.0, and MT-Bench, demonstrate that SAPO matches or surpasses established offline contrastive baselines, such as DPO and Odds Ratio Preference Optimization, and outperforms offline self-play methods like SPIN. Our code is available at https://github.com/yinyueqin/SAPO
翻译:传统的语言模型对齐方法(如直接偏好优化)受限于对静态预收集配对偏好数据的依赖,这限制了其适应性与实际应用性。为克服这一局限,我们提出了自增强偏好优化,这是一种高效且可扩展的训练范式,无需依赖现有配对数据。该方法基于自我博弈概念(可自主生成负向响应),并进一步结合了离策略学习流程以增强数据探索与利用能力。具体而言,我们采用指数移动平均模型与回放缓冲区相结合的方式,实现对响应片段的动态更新,从而有效整合实时反馈与历史数据洞察。我们在LLaMA3-8B和Mistral-7B模型上进行的综合评估(涵盖Open LLM Leaderboard、IFEval、AlpacaEval 2.0及MT-Bench等基准测试)表明,SAPO在性能上达到或超越了现有离线对比基线方法(如DPO和比值比偏好优化),并优于SPIN等离线自我博弈方法。代码已开源:https://github.com/yinyueqin/SAPO