Diffusion language models generate without a fixed left-to-right order, leaving token ordering as a central algorithmic choice. Existing systems mainly use random masking or confidence-driven ordering, which respectively suffer from train--test mismatch and myopic exploration. We introduce DPRM (Doob -transform Process Reward Model), a plug-in token-ordering module that keeps the host architecture, denoising objective and supervision unchanged, and modifies only the ordering policy. DPRM starts from confidence-driven ordering and gradually shifts to process-reward-guided ordering through online estimates. We characterize the exact DPRM policy as a reward-tilted Gibbs reveal law, prove convergence of its stagewise Soft-BoN approximation, show that the online bucketized controller tracks the exact DPRM score at empirical-Bernstein rates, and establish a sample-complexity advantage under tractable optimization assumptions. Across nine hosts covering language reasoning, test-time scaling, protein, single-cell, molecular, DNA, text-to-image generation, and VQA, DPRM order variants improve several language, DNA, and multimodal settings while also identifying boundary cases where confidence-only ordering or task-specific utilities are preferable. Code is available at: https://github.com/DakeBU/DPRM-DLLM
翻译:扩散语言模型无需遵循固定的从左到右生成顺序,使得词序选择成为核心算法设计问题。现有系统主要采用随机掩码或置信度驱动排序策略,但前者存在训练与测试不匹配问题,后者则受限于短视探索。我们提出DPRM(Doob h变换过程奖励模型)——一种即插即用的词序排序模块,在保持主体架构、去噪目标及监督信号不变的前提下,仅修改排序策略。DPRM以置信度驱动排序为起点,通过在线估计逐步过渡到过程奖励引导的排序。我们将该精确策略表征为奖励偏置的吉布斯揭示律,证明了其分段式Soft-BoN近似的收敛性,验证了在线分桶控制器以经验-伯恩斯坦速率追踪精确DPRM分数的能力,并在可处理优化假设下建立了样本复杂度优势。在涵盖语言推理、测试时扩展、蛋白质、单细胞、分子、DNA、文本到图像生成及VQA九种主体架构的实验中,DPRM排序变体在多项语言、DNA及多模态任务上取得性能提升,同时识别出置信度驱动排序或任务特定效用更优的边界情况。代码开源地址:https://github.com/DakeBU/DPRM-DLLM