Direct Preference Optimization (DPO) and its variants have become standard for aligning Large Language Models due to their simplicity and offline stability. However, we identify two fundamental limitations. First, the optimal policy depends on arbitrary modeling choices (scalarization function, reference policy), yielding behavior reflecting parameterization artifacts rather than true preferences. Second, treating response generation in isolation fails to leverage comparative information in pairwise data, leaving the model's capacity for intrinsic self-reflection untapped. To address it, we propose Intrinsic Self-reflective Preference Optimization (InSPO), deriving a globally optimal policy conditioning on both context and alternative responses. We prove this formulation superior to DPO/RLHF while guaranteeing invariance to scalarization and reference choices. InSPO serves as a plug-and-play enhancement without architectural changes or inference overhead. Experiments demonstrate consistent improvements in win rates and length-controlled metrics, validating that unlocking self-reflection yields more robust, human-aligned LLMs.
翻译:直接偏好优化及其变体因其简单性和离线稳定性,已成为对齐大语言模型的标准方法。然而,我们发现了两个根本性局限。首先,最优策略依赖于任意建模选择,导致行为反映的是参数化伪影而非真实偏好。其次,孤立处理响应生成未能利用成对数据中的比较信息,使得模型的内在自反思能力未被开发。为此,我们提出内在自反思偏好优化,推导出同时以上下文和备选响应为条件的全局最优策略。我们证明该公式在保证对量化和参考选择不变性的同时,优于直接偏好优化。该方案可作为即插即用的增强模块,无需架构修改或额外推理开销。实验结果表明,其在胜率和长度控制指标上均取得持续改进,验证了解锁自反思能力能够产生更鲁棒、更符合人类对齐需求的大语言模型。