Online on-policy preference learning algorithms for language model alignment such as online direct policy optimization (DPO) can significantly outperform their offline counterparts. We provide a theoretical explanation for this phenomenon by analyzing how the sampling policy's coverage evolves throughout on-policy training. We propose and rigorously justify the \emph{coverage improvement principle}: with sufficient batch size, each update moves into a region around the target where coverage is uniformly better, making subsequent data increasingly informative and enabling rapid convergence. In the contextual bandit setting with Bradley-Terry preferences and linear softmax policy class, we show that on-policy DPO converges exponentially in the number of iterations for batch size exceeding a generalized coverage threshold. In contrast, any learner restricted to offline samples from the initial policy suffers a slower minimax rate, leading to a sharp separation in total sample complexity. Motivated by this analysis, we further propose a simple hybrid sampler based on a novel \emph{preferential} G-optimal design, which removes dependence on coverage and guarantees convergence in just two rounds. Finally, we develop principled on-policy schemes for reward distillation in the general function class setting, and show faster noiseless rates under an alternative deviation-based notion of coverage. Experimentally, we confirm that on-policy DPO and our proposed reward distillation algorithms outperform their off-policy counterparts and enjoy stable, monotonic performance gains across iterations.
翻译:用于语言模型对齐的在线基于策略偏好学习算法,例如在线直接策略优化,能够显著优于其离线对应算法。我们通过分析采样策略的覆盖度在基于策略的训练过程中如何演变,为这一现象提供了理论解释。我们提出并严格论证了\emph{覆盖度提升原理}:在足够大的批次规模下,每次更新都会进入目标策略周围一个覆盖度均匀更优的区域,使得后续数据信息量递增,从而实现快速收敛。在具有布拉德利-特里偏好和线性softmax策略类的上下文赌博机设定中,我们证明了当批次规模超过广义覆盖度阈值时,基于策略的DPO在迭代次数上呈指数级收敛。相比之下,任何仅限于从初始策略进行离线采样的学习器都会遭受较慢的极小极大速率,从而导致总样本复杂度上的显著差异。受此分析启发,我们进一步提出了一种基于新颖的\emph{偏好性}G-最优设计的简单混合采样器,它消除了对覆盖度的依赖,并保证仅在两轮内收敛。最后,我们在一般函数类设定下为奖励蒸馏开发了原则性的基于策略方案,并在一种基于偏差的替代覆盖度概念下展示了更快的无噪声收敛速率。实验上,我们证实了基于策略的DPO和我们提出的奖励蒸馏算法优于其离策略对应算法,并在迭代过程中享有稳定、单调的性能提升。