Aligning large language models (LLMs) with human intentions has become a critical task for safely deploying models in real-world systems. While existing alignment approaches have seen empirical success, theoretically understanding how these methods affect model behavior remains an open question. Our work provides an initial attempt to theoretically analyze the learning dynamics of human preference alignment. We formally show how the distribution of preference datasets influences the rate of model updates and provide rigorous guarantees on the training accuracy. Our theory also reveals an intricate phenomenon where the optimization is prone to prioritizing certain behaviors with higher preference distinguishability. We empirically validate our findings on contemporary LLMs and alignment tasks, reinforcing our theoretical insights and shedding light on considerations for future alignment approaches. Disclaimer: This paper contains potentially offensive text; reader discretion is advised.
翻译:将大型语言模型(LLMs)与人类意图对齐已成为在实际系统中安全部署模型的关键任务。尽管现有的对齐方法在经验上取得了成功,但从理论上理解这些方法如何影响模型行为仍是一个待解问题。本文首次尝试从理论角度分析人类偏好对齐的学习动态。我们正式论证了偏好数据集的分布如何影响模型更新的速率,并为训练精度提供了严格的理论保证。我们的理论还揭示了一个复杂现象:优化过程倾向于优先处理具有较高偏好区分度的特定行为。我们在当代LLMs及对齐任务上通过实验验证了我们的发现,这不仅强化了理论洞察,也为未来对齐方法的设计提供了思考方向。免责声明:本文可能包含冒犯性文字,请读者酌情阅读。