Masked diffusion models have emerged as a powerful framework for text and multimodal generation. However, their sampling procedure updates multiple tokens simultaneously and treats generated tokens as immutable, which may lead to error accumulation when early mistakes cannot be revised. In this work, we revisit existing self-correction methods and identify limitations stemming from additional training requirements or reliance on misaligned likelihood estimates. We propose a training-free self-correction framework that exploits the inductive biases of pre-trained masked diffusion models. Without modifying model parameters or introducing auxiliary evaluators, our method significantly improves generation quality on text-to-image generation and multimodal understanding tasks with reduced sampling steps. Moreover, the proposed framework generalizes across different masked diffusion architectures, highlighting its robustness and practical applicability. Code can be found in https://github.com/huge123/FreeCorrection.
翻译:掩码扩散模型已成为文本与多模态生成的强大框架。然而,其采样过程同时更新多个标记并将已生成标记视为不可变,这可能导致早期错误无法修正时的误差累积。本研究重新审视现有自校正方法,发现其存在因额外训练需求或依赖未对齐似然估计而产生的局限性。我们提出一种无需训练的自校正框架,该框架利用预训练掩码扩散模型的归纳偏置。在不修改模型参数或引入辅助评估器的前提下,我们的方法在减少采样步骤的情况下,显著提升了文本到图像生成与多模态理解任务的生成质量。此外,所提框架可泛化至不同的掩码扩散架构,体现了其鲁棒性与实际适用性。代码可见于 https://github.com/huge123/FreeCorrection。