Diffusion Large Language Models (dLLMs) introduce a new paradigm for language generation, which in turn presents new challenges for aligning them with human preferences. In this work, we aim to improve the policy optimization for dLLMs by reducing the cost of the trajectory probability calculation, thereby enabling scaled-up offline policy training. We prove that: (i) under reference policy regularization, the probability ratio of the newly unmasked tokens is an unbiased estimate of that of intermediate diffusion states, and (ii) the probability of the full trajectory can be effectively estimated with a single forward pass of a re-masked final state. By integrating these two trajectory reduction strategies into a policy optimization objective, we propose Trajectory Reduction Policy Optimization (dTRPO). We evaluate dTRPO on 7B dLLMs across instruction-following and reasoning benchmarks. Results show that it substantially improves the core performance of state-of-the-art dLLMs, achieving gains of up to 9.6% on STEM tasks, up to 4.3% on coding tasks, and up to 3.0% on instruction-following tasks. Moreover, dTRPO exhibits strong training efficiency due to its offline, single-forward nature, and achieves improved generation efficiency through high-quality outputs.
翻译:扩散大型语言模型(dLLMs)为语言生成引入了新范式,同时也带来了使其与人类偏好对齐的新挑战。本研究旨在通过降低轨迹概率计算成本来改进dLLMs的策略优化,从而实现可扩展的离线策略训练。我们证明:(i)在参考策略正则化下,新解掩码词元的概率比是中间扩散状态概率比的无偏估计量;(ii)通过单次前向传递重掩码最终状态即可有效估计完整轨迹的概率。将这两种轨迹缩减策略整合到策略优化目标中,我们提出轨迹缩减策略优化(dTRPO)。我们在指令遵循和推理基准上对7B参数dLLMs进行评估,结果表明dTRPO显著提升了最先进dLLMs的核心性能:在STEM任务上提升高达9.6%,在编程任务上提升高达4.3%,在指令遵循任务上提升高达3.0%。此外,dTRPO凭借其离线单次前向特性展现出强大的训练效率,并通过高质量输出实现了更高的生成效率。