Aligning large language models (LLMs) with human intentions has become a critical task for safely deploying models in real-world systems. While existing alignment approaches have seen empirical success, theoretically understanding how these methods affect model behavior remains an open question. Our work provides an initial attempt to theoretically analyze the learning dynamics of human preference alignment. We formally show how the distribution of preference datasets influences the rate of model updates and provide rigorous guarantees on the training accuracy. Our theory also reveals an intricate phenomenon where the optimization is prone to prioritizing certain behaviors with higher preference distinguishability. We empirically validate our findings on contemporary LLMs and alignment tasks, reinforcing our theoretical insights and shedding light on considerations for future alignment approaches. Disclaimer: This paper contains potentially offensive text; reader discretion is advised.
翻译:将大型语言模型(LLMs)与人类意图对齐已成为在实际系统中安全部署模型的关键任务。尽管现有对齐方法已取得实证成功,但从理论上理解这些方法如何影响模型行为仍是一个开放性问题。我们的工作首次尝试对人类偏好对齐的学习动态进行理论分析。我们形式化地证明了偏好数据集的分布如何影响模型更新的速率,并为训练准确性提供了严格的理论保证。我们的理论还揭示了一个复杂现象:优化过程容易优先学习具有更高偏好区分度的特定行为。我们在当代LLMs和对齐任务上实证验证了我们的发现,这不仅强化了我们的理论见解,也为未来对齐方法的考量提供了启示。免责声明:本文包含可能具有冒犯性的文本,建议读者谨慎阅读。