Standard methods for aligning large language models with human preferences learn from pairwise comparisons among sampled candidate responses and regularize toward a reference policy. Despite their effectiveness, the effects of sampling and reference choices are poorly understood theoretically. We investigate these effects through Identity Preference Optimization, a widely used preference alignment framework, and show that proper instance-dependent sampling can yield stronger ranking guarantees, while skewed on-policy sampling can induce excessive concentration under structured preferences. We then analyze iterative alignment dynamics in which the learned policy feeds back into future sampling and reference policies, reflecting a common practice of model-generated preference data. We prove that these dynamics can exhibit persistent oscillations or entropy collapse for certain parameter choices, and characterize regimes that guarantee stability. Our theoretical insights extend to Direct Preference Optimization, indicating the phenomena we captured are common to a broader class of preference-alignment methods. Experiments on real-world preference data validate our findings.
翻译:标准的大语言模型对齐方法通过从采样的候选响应中进行成对比较来学习人类偏好,并向参考策略进行正则化。尽管这些方法有效,但采样和参考选择的影响在理论上尚未得到充分理解。我们通过身份偏好优化这一广泛使用的偏好对齐框架研究这些影响,并证明适当的实例依赖采样能够产生更强的排序保证,而偏斜的在线策略采样在结构化偏好下可能导致过度集中。随后,我们分析了迭代对齐动态,其中学习到的策略会反馈到未来的采样和参考策略中,这反映了使用模型生成偏好数据的常见实践。我们证明这些动态在特定参数选择下可能出现持续振荡或熵崩溃,并刻画了保证稳定性的参数区域。我们的理论洞见可扩展至直接偏好优化,表明所揭示的现象在更广泛的偏好对齐方法中普遍存在。在真实世界偏好数据上的实验验证了我们的发现。