Masked diffusion language models generate by iteratively filling masked tokens over multiple denoising steps, so learning only from a terminal reward on the final completion yields coarse credit assignment over intermediate decisions. We propose DiSPO (Diffusion-State Policy Optimization), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling fillings for the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens -- without additional multi-step diffusion rollouts. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that can be combined with terminal-feedback policy optimization using the same rollouts. On LLaDA-8B-Instruct, DiSPO consistently improves over the terminal-feedback diffu-GRPO baseline on math and planning benchmarks under matched rollout compute and optimizer steps. Our code will be available at https://daioba.github.io/dispo .
翻译:掩码扩散语言模型通过在多个去噪步骤中迭代填充掩码标记来生成文本,因此仅从最终完成度的终端奖励中学习会导致对中间决策的粗粒度信用分配。我们提出DiSPO(扩散状态策略优化),一种可直接优化中间填充决策的插件式信用分配层。在选定的中间掩码状态下,DiSPO通过从回滚缓存的逻辑值中重新采样当前掩码位置的填充内容进行分支,对生成的完成文本进行评分,并仅更新新填充的标记——无需额外的多步扩散回滚。我们为分支完成文本形式化定义了固定状态目标,并推导出可与使用相同回滚的终端反馈策略优化相结合的策略梯度估计器。在LLaDA-8B-Instruct模型上,在匹配的回滚计算量和优化器步数条件下,DiSPO在数学和规划基准测试中持续优于终端反馈diffu-GRPO基线方法。我们的代码将在https://daioba.github.io/dispo 发布。