Diffusion Transformer (DiT) has emerged as the new trend of generative diffusion models on image generation. In view of extremely slow convergence in typical DiT, recent breakthroughs have been driven by mask strategy that significantly improves the training efficiency of DiT with additional intra-image contextual learning. Despite this progress, mask strategy still suffers from two inherent limitations: (a) training-inference discrepancy and (b) fuzzy relations between mask reconstruction & generative diffusion process, resulting in sub-optimal training of DiT. In this work, we address these limitations by novelly unleashing the self-supervised discrimination knowledge to boost DiT training. Technically, we frame our DiT in a teacher-student manner. The teacher-student discriminative pairs are built on the diffusion noises along the same Probability Flow Ordinary Differential Equation (PF-ODE). Instead of applying mask reconstruction loss over both DiT encoder and decoder, we decouple DiT encoder and decoder to separately tackle discriminative and generative objectives. In particular, by encoding discriminative pairs with student and teacher DiT encoders, a new discriminative loss is designed to encourage the inter-image alignment in the self-supervised embedding space. After that, student samples are fed into student DiT decoder to perform the typical generative diffusion task. Extensive experiments are conducted on ImageNet dataset, and our method achieves a competitive balance between training cost and generative capacity.
翻译:扩散变压器(DiT)已成为图像生成领域生成扩散模型的新趋势。针对典型DiT收敛速度极慢的问题,近期突破性进展通过掩码策略实现,该策略借助额外的图像内部上下文学习显著提升了DiT的训练效率。尽管取得进展,掩码策略仍存在两个固有局限:(a)训练-推理不一致性,以及(b)掩码重建与生成扩散过程之间的模糊关联,导致DiT训练效果次优。在本工作中,我们通过创新性地释放自监督判别知识来增强DiT训练,从而解决这些局限。在技术上,我们以教师-学生方式构建DiT框架。教师-学生判别对建立在沿相同概率流常微分方程(PF-ODE)的扩散噪声上。我们不将掩码重建损失同时应用于DiT编码器和解码器,而是解耦DiT编码器和解码器,分别处理判别目标和生成目标。具体而言,通过使用学生和教师DiT编码器对判别对进行编码,我们设计了一种新的判别损失,以促进自监督嵌入空间中图像间的对齐。随后,学生样本被输入学生DiT解码器执行典型的生成扩散任务。在ImageNet数据集上进行了大量实验,我们的方法实现了训练成本与生成能力之间的竞争性平衡。