Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing token-level extensions typically decompose a sequence-level Bradley-Terry objective across timesteps, leaving per-prefix (state-wise) optimality implicit. We study how to recover token-level preference optimality using only standard sequence-level pairwise comparisons. We introduce Token-level Bregman Preference Optimization (TBPO), which posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix, and derive a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model and maintaining DPO-like simplicity. We introduce two instantiations: TBPO-Q, which explicitly learns a lightweight state baseline, and TBPO-A, which removes the baseline through advantage normalization. Across instruction following, helpfulness/harmlessness, and summarization benchmarks, TBPO improves alignment quality and training stability and increases output diversity relative to strong sequence-level and token-level baselines.
翻译:直接偏好优化(DPO)是一种广泛使用的无强化学习方法,通过成对偏好对齐语言模型,但其在序列层面建模偏好,而文本生成实际上由逐Token决策驱动。现有Token级扩展通常将序列级Bradley-Terry目标沿时间步分解,导致每个前缀(状态级)最优性隐式化。本文研究如何仅利用标准序列级成对比较恢复Token级偏好最优性。我们提出Token级Bregman偏好优化(TBPO),该方法针对以前缀为条件下一Token动作建立Token级Bradley-Terry偏好模型,并推导出基于Bregman散度的密度比匹配目标,该目标在泛化逻辑/DPO损失的同时,保留了Token级模型诱导的最优策略及DPO的简洁性。我们引入两种实例化:TBPO-Q显式学习轻量级状态基线,TBPO-A通过优势归一化去除基线。在指令遵循、有用性/无害性及摘要生成基准测试中,TBPO相对强序列级和Token级基线提升了对齐质量、训练稳定性及输出多样性。