We propose an efficient approach to train large diffusion models with masked transformers. While masked transformers have been extensively explored for representation learning, their application to generative learning is less explored in the vision domain. Our work is the first to exploit masked training to reduce the training cost of diffusion models significantly. Specifically, we randomly mask out a high proportion (\emph{e.g.}, 50\%) of patches in diffused input images during training. For masked training, we introduce an asymmetric encoder-decoder architecture consisting of a transformer encoder that operates only on unmasked patches and a lightweight transformer decoder on full patches. To promote a long-range understanding of full patches, we add an auxiliary task of reconstructing masked patches to the denoising score matching objective that learns the score of unmasked patches. Experiments on ImageNet-256$\times$256 show that our approach achieves the same performance as the state-of-the-art Diffusion Transformer (DiT) model, using only 31\% of its original training time. Thus, our method allows for efficient training of diffusion models without sacrificing the generative performance.
翻译:我们提出了一种高效的方法,利用掩码Transformer训练大规模扩散模型。尽管掩码Transformer已在表征学习领域得到广泛探索,但其在视觉领域生成学习中的应用仍鲜有研究。本文首次利用掩码训练显著降低扩散模型的训练成本。具体而言,在训练过程中,我们随机掩蔽扩散输入图像中高比例(例如50%)的图像块。为实现掩码训练,我们引入了一种非对称编码器-解码器架构:编码器部分采用Transformer仅处理未掩蔽的图像块,而解码器部分则使用轻量级Transformer处理全部图像块。为促进对完整图像块的长程理解,我们在去噪分数匹配目标(该目标学习未掩蔽图像块的分数)中引入了重构掩蔽图像块的辅助任务。在ImageNet-256×256上的实验表明,本方法在达到与当前最优扩散Transformer(DiT)模型同等性能的前提下,仅需其原始训练时间的31%。因此,本方法可在不牺牲生成性能的情况下实现扩散模型的高效训练。