Theoretical Limits of Language Model Alignment

Language model (LM) alignment improves model outputs to reflect human preferences while preserving the capabilities of the base model. The most common alignment approaches are (i) reinforcement learning, which maximizes the expected reward under a KL-divergence constraint, and (ii) best-of-$N$ alignment, which selects the highest-reward output among $N$ independent samples. Despite their widespread use, the fundamental limits of reward improvement under a KL budget remain poorly understood. We characterize the information-theoretic limits of KL-regularized alignment by deriving the maximum achievable expected reward gain for a fixed KL-divergence budget. Our first result provides a closed-form expression for the optimal reward improvement, governed by a Jeffreys divergence term rather than the $\sqrt{\texttt{KL}}$ used in prior analyses. We further reformulate this expression as a covariance under the base model, yielding a practical estimator that predicts achievable alignment gains from base model samples alone. We extend our analysis to the proxy reward setting, showing that the gap between ideal and proxy alignment (reward hacking) grows with the magnitude of reward error and when the KL penalty factor decreases. We then prove that reward ensembling mitigates reward hacking, providing a theoretical justification for this technique used in practice. Empirically, we compute the KL-reward Pareto frontier for two tasks for LMs, safety and summarization, and show that best-of-$N$ closely approaches the theoretical limit, while PPO and GRPO remain substantially suboptimal. Our theoretical results shed light on several empirically observed phenomena in the alignment literature and suggest that algorithmic improvements are needed to achieve optimal alignment without high inference costs.

翻译：语言模型对齐通过改进模型输出以反映人类偏好，同时保留基础模型的能力。最常见的对齐方法包括：(i) 强化学习，它在KL散度约束下最大化期望奖励；(ii) 最佳N选一（best-of-$N$）对齐，即从$N$个独立样本中选择奖励最高的输出。尽管这些方法被广泛使用，但在KL预算下奖励改进的基本极限仍未被充分理解。我们通过推导固定KL散度预算下最大可实现期望奖励增益，刻画了KL正则化对齐的信息论极限。我们的首个结果给出了最优奖励改进的闭式表达式，该表达式由杰弗里斯散度项主导，而非先前分析中使用的$\sqrt{\texttt{KL}}$。我们进一步将该表达式重新表述为基础模型下的协方差，从而得到一种实用估计器，可仅从基础模型样本预测可实现的对齐增益。我们将分析扩展到代理奖励设置，证明理想对齐与代理对齐（奖励黑客）之间的差距随奖励误差幅度增大及KL惩罚因子减小而扩大。随后我们证明奖励集成能缓解奖励黑客问题，为实践中广泛使用的这一技术提供了理论支持。在实验层面，我们计算了两种语言模型任务（安全性与文本摘要）的KL-奖励帕累托前沿，结果表明最佳N选一（best-of-$N$）方法能紧密逼近理论极限，而PPO和GRPO仍存在显著次优性。我们的理论成果揭示了对齐文献中多个经验观察现象的深层机制，并指出为实现无需高推理成本的最优对齐，亟需算法层面的改进。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

大型语言模型的规模效应局限

专知会员服务

14+阅读 · 2025年11月18日

语言模型如何重塑实体对齐？语言模型驱动实体对齐的进展、基准与未来

专知会员服务

8+阅读 · 2025年11月2日

【阿姆斯特丹博士论文】语言模型与人类理解与行为的对齐

专知会员服务

18+阅读 · 2025年7月19日