Large language models (LLMs) have demonstrated remarkable capabilities but often struggle to align with human preferences, leading to harmful or undesirable outputs. Preference learning, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for ensuring that LLMs align with human values. Despite the widespread adoption in real-world systems, a thorough theoretical understanding of the generalization guarantees for these models remain lacking. This paper bridges that gap by introducing a new theoretical framework to analyze the generalization guarantees of models trained with direct preference optimization (DPO). While existing generalization theory often focuses on overparameterized models achieving near-optimal loss or models independent of the training process, our framework rigorously assesses how well models generalize after a finite number of gradient steps, reflecting real-world LLM training practices. By analyzing the reward margin associated with each sample and its trajectory throughout training, we can effectively bound the generalization error. We derive learning guarantees showing that, under specific conditions, models trained with DPO can correctly discern preferred responses on unseen data with high probability. These insights are empirically validated on contemporary LLMs, underscoring the practical relevance of our theoretical findings.
翻译:大型语言模型(LLM)已展现出卓越能力,但常难以与人类偏好对齐,导致产生有害或不理想的输出。偏好学习通过基于人类反馈训练模型以区分优选与非优选响应,已成为确保LLM与人类价值观对齐的关键组成部分。尽管该技术在实际系统中得到广泛应用,但对其泛化保证的理论理解仍显不足。本文通过提出新的理论框架来分析直接偏好优化(DPO)训练模型的泛化保证,从而弥补这一空白。现有泛化理论多关注达到近最优损失的过参数化模型或独立于训练过程的模型,而我们的框架严格评估模型在有限梯度步数后的泛化能力——这反映了实际LLM训练场景。通过分析每个样本相关的奖励边际及其在训练过程中的轨迹变化,我们能够有效界定泛化误差。推导出的学习保证表明,在特定条件下,经DPO训练的模型能以高概率正确识别未见过数据中的优选响应。这些发现在当代LLM上得到实证验证,凸显了我们理论成果的实际意义。