The Distributional Alignment Game framework provides a powerful variational perspective on Answer-Level Fine-Tuning (ALFT). However, standard algorithms for these games rely on estimating logarithmic rewards from small batches, introducing a systematic bias due to Jensen's inequality that can destabilize training. In this paper, we systematically resolve this structural estimation bias. First, we generalize the alignment game to arbitrary Bregman divergences, showing that for a family of geometries inducing polynomial rewards, we can construct provably exact and unbiased estimators using U-statistics. Second, for the canonical KL divergence game where an exact solution is impossible, we derive a globally robust minimax polynomial estimator that is provably optimal, achieving the fundamental statistical error limit of $Θ(1/K^2)$, which we establish via the Ditzian-Totik theorem. Finally, we synthesize these two approaches to propose a novel Variance-Optimal Augmented Polynomial Optimization Program (AQP) Estimator, proving that by systematically reducing variance, our method achieves not only optimal bias but also provably accelerated game convergence, leading to more efficient and stable training with zero online computational overhead.
翻译:分布对齐博弈框架为答案级微调(ALFT)提供了一种极具价值的变分视角。然而,这些博弈的标准算法依赖于从小批量数据中估计对数奖励,由于詹森不等式引入的系统性偏差可能导致训练不稳定。本文系统地解决了这一结构性估计偏差问题。首先,我们将对齐博弈推广至任意布雷格曼散度,证明对于诱导多项式奖励的一类几何结构,可以利用U统计量构造出可证明精确且无偏的估计器。其次,针对无法实现精确解的典型KL散度博弈,我们推导出一个全局鲁棒的极小极大多项式估计器,该估计器可证明达到最优性能,实现了由迪齐安-托蒂克定理确定的$\Theta(1/K^2)$基本统计误差极限。最后,我们综合这两种方法,提出一种新颖的方差最优增广多项式优化程序(AQP)估计器,通过系统性方差缩减证明我们的方法不仅能实现最优偏差,还能可证明地加速博弈收敛,从而在零在线计算开销下实现更高效稳定的训练。