Masked Diffusion Language Models (MDLMs) generate text by iteratively filling masked tokens, requiring two coupled decisions at each step: which positions to unmask (where-to-unmask) and which tokens to place (what-to-unmask). While standard MDLM training directly optimizes token prediction (what-to-unmask), inference-time unmasking orders (where-to-unmask) are typically determined by heuristic confidence measures or trained through reinforcement learning with costly on-policy rollouts. To address this, we introduce Gt-Margin, a position-wise score derived from ground-truth tokens, defined as the probability margin between the correct token and its strongest alternative. Gt-Margin yields an oracle unmasking order that prioritizes easier positions first under each partially masked state. We demonstrate that leveraging this oracle unmasking order significantly enhances final generation quality, particularly on logical reasoning benchmarks. Building on this insight, we train a supervised unmasking planner via learning-to-rank to imitate the oracle ordering from masked contexts. The resulting planner integrates into standard MDLM sampling to select where-to-unmask, improving reasoning accuracy without modifying the token prediction model.
翻译:掩码扩散语言模型通过迭代填充掩码标记来生成文本,其每一步都需要耦合两个决策:对哪些位置进行去掩码(何处去掩码)以及放置哪些标记(以何内容去掩码)。虽然标准的MDLM训练直接优化标记预测(以何内容去掩码),但推理阶段的去掩码顺序(何处去掩码)通常由启发式置信度度量决定,或通过需要高成本在线策略推演的强化学习进行训练。为解决此问题,我们提出了Gt-Margin——一种源自真实标记的逐位置评分,其定义为正确标记与其最强替代项之间的概率差值。Gt-Margin产生了一种先知去掩码顺序,该顺序在每个部分掩码状态下优先处理更易预测的位置。我们证明,利用这种先知去掩码顺序能显著提升最终生成质量,尤其在逻辑推理基准测试中表现突出。基于这一发现,我们通过排序学习训练了一个监督式去掩码规划器,以模仿从掩码上下文生成的神谕排序。所得规划器可集成到标准MDLM采样过程中,用于选择何处去掩码,从而在不修改标记预测模型的情况下提升推理准确性。