Improving Sampling for Masked Diffusion Models via Information Gain

Masked Diffusion Models (MDMs) offer greater flexibility in decoding order than autoregressive models but require careful planning to achieve high-quality generation. Existing samplers typically adopt greedy heuristics, prioritizing positions with the highest local certainty to decode at each step. Through failure case analysis, we identify a fundamental limitation of this approach: it neglects the downstream impact of current decoding choices on subsequent steps and fails to minimize cumulative uncertainty. In particular, these methods do not fully exploit the non-causal nature of MDMs, which enables evaluating how a decoding decision reshapes token probabilities/uncertainty across all remaining masked positions. To bridge this gap, we propose the Info-Gain Sampler, a principled decoding framework that balances immediate uncertainty with information gain over future masked tokens. Extensive evaluations across diverse architectures and tasks (reasoning, coding, creative writing, and image generation) demonstrate that Info-Gain Sampler consistently outperforms existing samplers for MDMs. For instance, it achieves a 3.6% improvement in average accuracy on reasoning tasks and a 63.1% win-rate in creative writing. Notably, on reasoning tasks it reduces cumulative uncertainty from 78.4 to 48.6, outperforming the best baseline by a large margin. The code will be available at https://github.com/yks23/Information-Gain-Sampler.

翻译：掩码扩散模型（MDMs）相比自回归模型在解码顺序上具有更高的灵活性，但需要精心规划以实现高质量生成。现有采样器通常采用贪心启发式策略，每一步优先解码局部确定性最高的位置。通过失败案例分析，我们发现该方法存在根本性局限：它忽略了当前解码选择对后续步骤的下游影响，且未能最小化累积不确定性。具体而言，这些方法未能充分利用MDMs的非因果特性——该特性使得我们可以评估解码决策如何重塑所有剩余掩码位置的标记概率/不确定性。为弥补这一缺陷，我们提出信息增益采样器，这是一种平衡即时不确定性与未来掩码标记信息增益的原则性解码框架。通过对多样化架构和任务（推理、代码生成、创意写作和图像生成）的广泛评估，证明信息增益采样器在MDMs中始终优于现有采样器。例如，在推理任务上平均准确率提升3.6%，在创意写作任务中胜率达到63.1%。值得注意的是，在推理任务上它将累积不确定性从78.6降至48.6，显著超越最佳基线方法。代码将在https://github.com/yks23/Information-Gain-Sampler发布。

相关内容

信息增益

关注 0

信息增益（Kullback–Leibler divergence）又叫做information divergence，relative entropy 或者KLIC。在概率论和信息论中，信息增益是非对称的，用以度量两种概率分布P和Q的差异。信息增益描述了当使用Q进行编码时，再使用P进行编码的差异。通常P代表样本或观察值的分布，也有可能是精确计算的理论分布。Q代表一种理论，模型，描述或者对P的近似。

内省扩散语言模型

专知会员服务

13+阅读 · 4月14日

扩散语言模型综述

专知会员服务

19+阅读 · 2025年8月15日

【CVPR2025】并非所有参数都重要：通过参数掩码提升扩散模型的生成能力

专知会员服务

12+阅读 · 2025年5月9日