Two divergence regimes dominate modern alignment practice. Supervised fine-tuning and many distillation-style objectives implicitly minimize the forward KL divergence KL(q || pi_theta), yielding stable mode-covering updates but often under-exploiting high-reward modes. In contrast, PPO-style online reinforcement learning from human feedback behaves closer to reverse KL divergence KL(pi_theta || q), enabling mode-seeking improvements but risking mode collapse. Recent anchored methods, such as ADPO, show that performing the projection in anchored coordinates can substantially improve stability, yet they typically commit to a single divergence. We introduce Alpha-Divergence Preference Optimization (APO), an anchored framework that uses Csiszar alpha-divergence to continuously interpolate between forward and reverse KL behavior within the same anchored geometry. We derive unified gradient dynamics parameterized by alpha, analyze gradient variance properties, and propose a practical reward-and-confidence-guarded alpha schedule that transitions from coverage to exploitation only when the policy is both improving and confidently calibrated. Experiments on Qwen3-1.7B with math-level3 demonstrate that APO achieves competitive performance with GRPO and GSPO baselines while maintaining training stability.
翻译:现代对齐实践主要受两种散度机制主导。监督微调与多数蒸馏式目标隐式地最小化前向KL散度KL(q || π_θ),虽能产生稳定的覆盖式更新,却常未能充分利用高奖励模式。相比之下,基于人类反馈的PPO式在线强化学习更接近反向KL散度KL(π_θ || q),可实现模式寻求式改进,但存在模式坍塌风险。近年提出的锚定方法(如ADPO)表明,在锚定坐标中进行投影能显著提升稳定性,但这些方法通常仅采用单一散度。本文提出Alpha散度偏好优化(APO),这是一种基于Csiszar α散度的锚定框架,可在同一锚定几何结构中连续插值前向与反向KL行为。我们推导出以α为参数的统一梯度动力学,分析梯度方差特性,并提出一种实用的奖励-置信度引导α调度策略:仅当策略同时满足性能提升与置信校准条件时,才从覆盖阶段过渡至利用阶段。在Qwen3-1.7B模型与math-level3数据集上的实验表明,APO在保持训练稳定性的同时,其性能与GRPO和GSPO基线模型具有竞争力。