Masked diffusion language models can reduce inference steps by revealing multiple tokens per denoising iteration, but this parallelism is fragile: positions that are individually confident may be unsafe to commit together when their predictions are coupled. Existing training-free samplers such as Top-\(k\), Fast-dLLM, and EB-Sampler mainly control how many tokens to reveal, while often ranking candidates by token-wise scores that ignore interactions within the selected set. We propose ADAS, a training-free reranking rule for parallel masked diffusion decoding. ADAS leaves the base sampler's stopping rule unchanged and modifies only subset construction: it greedily discounts a candidate when it attends strongly to already selected positions whose predictions remain uncertain. Unlike graph-constrained methods that turn attention into hard compatibility constraints, ADAS keeps attention continuous and uses it as a soft marginal penalty. Across LLaDA-8B-Base and Dream-7B-Base on GSM8K, MATH500, HumanEval, and MBPP, plugging ADAS into Top-\(k\), Fast-dLLM, and EB-Sampler improves low-NFE performance at matched denoiser evaluations by \(9.11\) and \(10.46\) percentage points on average, respectively, with \(3.1\%\) per-forward runtime overhead. These results show that soft attention-discounted reranking is a simple and modular way to improve quality in highly parallel decoding for masked diffusion language models.
翻译:遮蔽扩散语言模型可通过每次去噪迭代揭示多个标记来减少推理步骤,但这种并行性较为脆弱:当预测相互关联时,原本各自置信度较高的位置可能并不适合同时确定。现有无需训练的采样器(如Top-\(k\)、Fast-dLLM和EB-Sampler)主要控制揭示标记的数量,而通常依据忽略选定集合内交互的词元级得分对候选进行排序。我们提出了ADAS——一种用于并行遮蔽扩散解码的免训练重排序规则。ADAS保持基础采样器的停止准则不变,仅修改子集构建方式:当某一候选与预测仍不确定且已被选中的位置存在强注意力关联时,该候选会被贪婪地降低权重。与将注意力转化为硬兼容性约束的图约束方法不同,ADAS保持注意力连续性,并将其作为软边际惩罚项。在LLaDA-8B-Base和Dream-7B-Base模型上,针对GSM8K、MATH500、HumanEval和MBPP数据集,将ADAS嵌入Top-\(k\)、Fast-dLLM和EB-Sampler后,在匹配去噪器评估时,低NFE(迭代步数)性能平均分别提升9.11和10.46个百分点,每次前向传播运行时间开销仅为3.1%。这些结果表明,软注意力折扣重排序是一种简单且模块化的方法,可提升遮蔽扩散语言模型高并行解码的质量。