Structure from Strategic Interaction & Uncertainty Risk Sensitive Games for Robust Preference Learning

A growing line of work reframes preference-based fine-tuning of large language models game-theoretically: Nash Learning from Human Feedback (NLHF) recasts the problem as a zero-sum game over policies. However, optimization is over expected pairwise payoffs, thereby conflating policies with similar win rates but different tail behavior. As such, these methods are agnostic to where in the data distribution they succeed or fail: strong average performance can mask systematic failure across prompts, annotators, or safety-critical strata. We introduce risk-sensitive preference games, in which players optimize convex risk measures of their preference loss, exploiting structure in preference uncertainty. While risk-sensitivity generally breaks the zero-sum structure, we show that translation invariance of many risk metrics ensures that we retain monotonicity, yielding fast convergence of sample-efficient self-play methods. Furthermore, we establish algorithmic stability and offline sample complexity bounds that scale with risk, requiring simultaneous control of structural bias from nonlinear risk transformations, statistical bias in risk estimation, and concentration tailored to the risk-sensitive setting. To address statistical bias, we introduce a hierarchical game formulation and a two-timescale extragradient algorithm with bias correction that converges to the Stackelberg equilibrium and is especially effective in low-sample regimes. Empirically, risk-adjusted policies are robust across data strata, stable across risk choices, and match or exceed risk-neutral performance thereby achieving robustness without a performance tax.

翻译：越来越多的研究从博弈论视角重新审视基于人类偏好的大语言模型微调：基于人类反馈的纳什学习（NLHF）将该问题重构为策略空间上的零和博弈。然而，现有优化方法针对期望成对收益进行优化，导致不同策略即便胜率相近但尾部行为特征迥异时难以区分。因此，这些方法对数据分布中成功或失败的具体位置缺乏敏感性——强平均表现可能掩盖跨提示、标注者或安全关键层级存在的系统性失效。我们提出风险敏感偏好博弈，其中博弈者通过利用偏好不确定性中的结构特征，对偏好损失的凸风险度量进行优化。尽管风险敏感性通常破坏零和结构，但我们证明许多风险度量的平移不变性确保单调性得以保留，从而使得样本高效的自我对弈方法能够快速收敛。此外，我们建立了随风险尺度变化的算法稳定性与离线样本复杂度边界，这需要同时控制非线性风险变换带来的结构性偏差、风险估计中的统计偏差，以及适应风险敏感场景的集中性约束。为应对统计偏差，我们提出分层博弈框架与含偏差校正的双时间尺度额外梯度算法，该算法能收敛至斯塔克尔伯格均衡，在低样本场景下尤其有效。实验表明，风险调整后的策略在数据各层级中保持鲁棒性，在不同风险选择下稳定，且性能持平或优于风险中性方案，从而在不产生性能代价的前提下实现鲁棒性。