Diffusion large language models (dLLMs) offer a promising route to parallel and efficient text generation, but improving their reasoning ability requires effective post-training. Reinforcement learning with verifiable rewards (RLVR) is a natural choice for this purpose, yet its application to dLLMs is hindered by the absence of tractable sequence-level log-ratios, which are central to standard policy optimization. The lack of tractable sequence-level log-ratios forces existing methods to rely on high-variance ELBO-based approximations, where high verifier rewards can amplify inaccurate score estimates and destabilize RL training. To overcome this issue, we propose \textbf{R}elative \textbf{S}core \textbf{P}olicy \textbf{O}ptimization (RSPO), a simple RLVR method that uses verifiable rewards to calibrate noisy likelihood estimates in dLLMs. The core of our algorithm relies on a key observation: a reward advantage can be interpreted not only as an update direction, but also as a target for the relative log-ratio between the current and reference policies. Accordingly, RSPO calibrates this noisy relative log-ratio estimate by comparing its reward advantage with the reward-implied target relative log-ratio, updating the policy according to the gap between the current estimate and the target rather than the raw advantage alone. Experiments on mathematical reasoning and planning benchmarks show that RSPO yields especially strong gains on planning tasks and competitive mathematical-reasoning performance.
翻译:扩散大语言模型(dLLMs)为实现并行高效文本生成提供了有前景的路径,但提升其推理能力需要有效的后训练。可验证奖励的强化学习(RLVR)天然适用于此目标,然而其在dLLMs中的应用受限于缺乏可计算的序列级对数比率——这一要素对于标准策略优化至关重要。可计算的序列级对数比率的缺失迫使现有方法依赖基于高方差ELBO的近似,而高验证器奖励可能放大不准确的分数估计并破坏RL训练的稳定性。为解决此问题,我们提出**相对分数策略优化**(RSPO),这是一种简洁的RLVR方法,通过可验证奖励校准dLLMs中带噪声的似然估计。该算法的核心基于关键观察:奖励优势不仅可解释为更新方向,还可视为当前策略与参考策略之间相对对数比率的优化目标。据此,RSPO通过将当前奖励优势与奖励隐含的目标相对对数比率进行比较来校准带噪声的相对对数比率估计,并根据当前估计与目标之间的差距(而非原始优势值)更新策略。在数学推理和规划基准上的实验表明,RSPO在规划任务上展现出显著优势,并在数学推理任务上取得具有竞争力的性能。