One-step generators distilled from Masked Diffusion Models (MDMs) compress multiple sampling steps into a single forward pass, enabling efficient text and image synthesis. However, they suffer two key limitations: they inherit modeling bias from the teacher, and their discrete token outputs block gradient flow, preventing post-distillation refinements such as adversarial training, reward-based fine-tuning, and Test-Time Embedding Optimization (TTEO). In this work, we introduce soft embeddings, a simple relaxation that replaces discrete tokens with the expected embeddings under the generator's output distribution. Soft embeddings preserve representation fidelity for one-step discrete generator while providing a fully differentiable continuous surrogate that is compatible with teacher backbones and tokenizer decoders. Integrating soft embeddings into the Di[M]O distillation framework (denoted Soft-Di[M]O) makes one-step generators end-to-end trainable and enables straightforward application of GAN-based refinement, differentiable reward fine-tuning, and TTEO. Empirically, across multiple MDM teachers (e.g., MaskBit, MaskGen), Soft-Di[M]O achieves state-of-the-art one-step results: improved class-to-image performance, a one-step FID of 1.56 on ImageNet-256 with GAN-based refinement, along with higher GenEval and HPS scores on text-to-image with reward fine-tuning, and further gains from TTEO.
翻译:从掩码扩散模型中蒸馏得到的一阶生成器将多个采样步骤压缩为单次前向传播,从而实现高效的文本与图像合成。然而,此类方法存在两大关键局限:其一,它们继承了教师模型的建模偏差;其二,其离散令牌输出阻断了梯度流,导致无法在蒸馏后执行对抗训练、基于奖励的微调以及测试时嵌入优化等优化操作。本研究提出软嵌入概念——通过生成器输出分布的期望嵌入替代离散令牌,这是一种简洁的松弛化方法。该策略在保持一阶离散生成器表征保真度的同时,提供了完全可微的连续替代表征,可与教师骨干网络和分词器解码器兼容。将软嵌入整合至Di[M]O蒸馏框架(记为软性Di[M]O)后,一阶生成器可实现端到端训练,并可便捷地应用基于生成对抗网络的优化、可微奖励微调及TTEO。实验表明,在多个掩码扩散模型教师(如MaskBit、MaskGen)上,软性Di[M]O均取得一阶生成器最优性能:类别到图像任务表现提升,ImageNet-256数据集上结合GAN优化的一阶FID达1.56;文本到图像任务中经奖励微调后GenEval与HPS得分更优,且通过TTEO获得进一步增益。