Aligning large language models (LLMs) with human intentions has become a critical task for safely deploying models in real-world systems. While existing alignment approaches have seen empirical success, theoretically understanding how these methods affect model behavior remains an open question. Our work provides an initial attempt to theoretically analyze the learning dynamics of human preference alignment. We formally show how the distribution of preference datasets influences the rate of model updates and provide rigorous guarantees on the training accuracy. Our theory also reveals an intricate phenomenon where the optimization is prone to prioritizing certain behaviors with higher preference distinguishability. We empirically validate our findings on contemporary LLMs and alignment tasks, reinforcing our theoretical insights and shedding light on considerations for future alignment approaches. Disclaimer: This paper contains potentially offensive text; reader discretion is advised.
翻译:将大型语言模型(LLMs)与人类意图对齐已成为在真实系统中安全部署模型的关键任务。尽管现有的对齐方法已取得经验上的成功,但从理论上理解这些方法如何影响模型行为仍是一个悬而未决的问题。我们的工作首次尝试从理论上分析人类偏好对齐的学习动态。我们正式展示了偏好数据集的分布如何影响模型更新的速率,并提供了关于训练准确性的严谨保证。我们的理论还揭示了一个复杂现象:优化过程倾向于优先处理具有较高偏好可区分性的特定行为。我们在当代语言模型和对齐任务上通过实验验证了我们的发现,既强化了理论洞见,也为未来对齐方法的设计提供了参考。免责声明:本文可能包含冒犯性文字,请读者酌情阅读。