Iterative alignment methods based on purely greedy updates are remarkably effective in practice, yet existing theoretical guarantees of \(O(\log T)\) KL-regularized regret can seem pessimistic relative to their empirical performance. In this paper, we argue that this mismatch arises from the regret criterion itself: KL-regularized regret conflates the statistical cost of learning with the exploratory randomization induced by the softened training policy. To separate these effects, we study the traditional temperature-zero regret criterion, which evaluates only the top-ranked response at inference time. Under this decision-centric notion of performance, we prove that standard greedy online alignment methods, including online RLHF and online DPO, achieve constant \((O(1))\) cumulative regret. By isolating the cost of identifying the best response from the stochasticity induced by regularization, our results provide a sharper theoretical explanation for the practical superb efficiency of greedy alignment.
翻译:基于纯贪心更新的迭代对齐方法在实践中效果显著,然而现有关于 \(O(\log T)\) KL正则化遗憾的理论保证与其经验表现相比显得过于悲观。在本文中,我们认为这种不匹配源于遗憾准则本身:KL正则化遗憾将学习的统计成本与软化训练策略引入的探索性随机化混为一谈。为分离这些效应,我们研究了传统温度为零的遗憾准则,该准则仅在推理时评估排名最高的响应。基于这种以决策为中心的性能概念,我们证明标准贪心在线对齐方法(包括在线RLHF和在线DPO)可实现常数级 \((O(1))\) 累积遗憾。通过将识别最优响应的成本与正则化引入的随机性相隔离,我们的结果为贪心对齐的实践优异效率提供了更精确的理论解释。