Aligning large language models (LLMs) to serve users with heterogeneous and potentially conflicting preferences is a central challenge for personalized and trustworthy AI. We formalize an ideal notion of universal alignment through test-time scaling: for each prompt, the model produces $k\ge 1$ candidate responses and a user selects their preferred one. We introduce $(k,f(k))$-robust alignment, which requires the $k$-output model to have win rate $f(k)$ against any other single-output model, and asymptotic universal alignment (U-alignment), which requires $f(k)\to 1$ as $k\to\infty$. Our main result characterizes the optimal convergence rate: there exists a family of single-output policies whose $k$-sample product policies achieve U-alignment at rate $f(k)=\frac{k}{k+1}$, and no method can achieve a faster rate in general. We show that popular post-training methods, including Nash learning from human feedback (NLHF), can fundamentally underutilize the benefits of test-time scaling. Even though NLHF is optimal for $k=1$, sampling from the resulting (often deterministic) policy cannot guarantee win rates above $\tfrac{1}{2}$ except for an arbitrarily small slack. This stems from a lack of output diversity: existing alignment methods can collapse to a single majority-preferred response, making additional samples redundant. In contrast, our approach preserves output diversity and achieves the optimal test-time scaling rate. In particular, we propose a family of symmetric multi-player alignment games and prove that any symmetric Nash equilibrium policy of the $(k+1)$-player alignment game achieves the optimal $(k,\frac{k}{k+1})$-robust alignment. Finally, we provide theoretical convergence guarantees for self-play learning dynamics in these games and extend the framework to opponents that also generate multiple responses.
翻译:使大型语言模型(LLM)能够服务于具有异构且可能冲突偏好的用户,是个性化与可信人工智能的核心挑战。我们通过测试时缩放形式化了一种理想的通用对齐概念:对于每个提示,模型生成 $k\ge 1$ 个候选响应,用户从中选择其偏好项。我们引入 $(k,f(k))$-鲁棒对齐,要求 $k$ 输出模型相对于任何单输出模型具有 $f(k)$ 的胜率;以及渐近通用对齐(U-对齐),要求当 $k\to\infty$ 时 $f(k)\to 1$。我们的主要结果刻画了最优收敛速率:存在一族单输出策略,其 $k$ 样本乘积策略能以速率 $f(k)=\frac{k}{k+1}$ 实现 U-对齐,且任何方法在一般情况下均无法达到更快的速率。我们证明,包括基于人类反馈的纳什学习(NLHF)在内的主流后训练方法,可能从根本上未能充分利用测试时缩放的效益。尽管 NLHF 在 $k=1$ 时是最优的,但从其所得(通常是确定性的)策略中采样,除任意小的松弛量外,无法保证胜率超过 $\tfrac{1}{2}$。这源于输出多样性的缺失:现有对齐方法可能坍缩至单一多数偏好响应,使得额外样本变得冗余。相比之下,我们的方法保持了输出多样性并实现了最优的测试时缩放速率。具体而言,我们提出了一族对称多玩家对齐博弈,并证明 $(k+1)$-玩家对齐博弈的任何对称纳什均衡策略均能达到最优的 $(k,\frac{k}{k+1})$-鲁棒对齐。最后,我们为这些博弈中的自博弈学习动态提供了理论收敛保证,并将该框架扩展至同样生成多响应的对手场景。