Improving the reasoning capabilities of diffusion-based large language models (dLLMs) through reinforcement learning (RL) remains an open problem. The intractability of dLLMs likelihood function necessitates approximating the current, old, and reference policy likelihoods at each policy optimization step. This reliance introduces additional computational overhead, and can lead to large variance and estimation error in RL objective -- particularly in computing the policy ratio for importance sampling. To mitigate these issues, we introduce wd1, a novel ratio-free policy optimization approach that reformulates the RL objective as a weighted log-likelihood, requiring only a single approximation for the current parametrized policy likelihood. We formally show that our proposed method can be interpreted as energy-guided discrete diffusion training combined with negative sample unlearning, thereby confirming its theoretical soundness. In experiments on LLaDA-8B model, wd1 outperforms diffusion-based GRPO (d1) while requiring lower computational cost, achieving up to a $+59\%$ improvement in accuracy. Furthermore, we extend wd1 to denoising-stepwise weighted policy optimization (wd1++), achieving state-of-the-art math performance of $44.2\%$ on MATH500 and $84.5\%$ on GSM8K with only 20 RL training steps.
翻译:通过强化学习提升基于扩散的大语言模型的推理能力仍是一个开放性问题。扩散大语言模型似然函数的难解性要求在每次策略优化步骤中近似计算当前策略、旧策略和参考策略的似然值。这种依赖关系引入了额外的计算开销,并可能导致强化学习目标中的高方差和估计误差——特别是在计算重要性采样的策略比率时。为缓解这些问题,我们提出了wd1,一种无需比率计算的新型策略优化方法,该方法将强化学习目标重新表述为加权对数似然,仅需对当前参数化策略的似然进行一次近似计算。我们通过形式化证明表明,所提方法可解释为能量引导的离散扩散训练与负样本遗忘的结合,从而确认了其理论严谨性。在LLaDA-8B模型上的实验表明,wd1在降低计算成本的同时超越了基于扩散的GRPO方法,准确率最高提升达$+59\%$。此外,我们将wd1扩展为逐去噪步加权策略优化方法,仅通过20次强化学习训练步骤,即在MATH500数据集上取得$44.2\%$、在GSM8K数据集上取得$84.5\%$的顶尖数学推理性能。