Masked diffusion models have shown promising performance in generating high-quality samples in a wide range of domains, but accelerating their sampling process remains relatively underexplored. To investigate efficient samplers for masked diffusion, this paper theoretically analyzes the MaskGIT sampler for image modeling, revealing its implicit temperature sampling mechanism. Through this analysis, we introduce the "moment sampler," an asymptotically equivalent but more tractable and interpretable alternative to MaskGIT, which employs a "choose-then-sample" approach by selecting unmasking positions before sampling tokens. In addition, we improve the efficiency of choose-then-sample algorithms through two key innovations: a partial caching technique for transformers that approximates longer sampling trajectories without proportional computational cost, and a hybrid approach formalizing the exploration-exploitation trade-off in adaptive unmasking. Experiments in image and text domains demonstrate our theory as well as the efficiency of our proposed methods, advancing both theoretical understanding and practical implementation of masked diffusion samplers.
翻译:掩码扩散模型在广泛领域中已展现出生成高质量样本的潜力,但其采样过程的加速仍相对缺乏深入探索。为研究掩码扩散的高效采样器,本文从理论上分析了用于图像建模的MaskGIT采样器,揭示了其隐含的温度采样机制。通过此分析,我们引入了“矩采样器”——一种与MaskGIT渐近等价但更易处理、可解释的替代方案,其采用“先选后采”策略,即在采样标记前先选择解掩码位置。此外,我们通过两项关键创新提升了先选后采算法的效率:一种用于Transformer的部分缓存技术,能以非比例计算成本近似更长的采样轨迹;以及一种形式化自适应解掩码中探索-利用权衡的混合方法。在图像与文本领域的实验验证了我们的理论以及所提方法的效率,推进了对掩码扩散采样器的理论理解与实际实现。