In this paper, we study the denoising diffusion probabilistic model (DDPM) in wavelet space, instead of pixel space, for visual synthesis. Considering the wavelet transform represents the image in spatial and frequency domains, we carefully design a novel architecture SFUNet to effectively capture the correlation for both domains. Specifically, in the standard denoising U-Net for pixel data, we supplement the 2D convolutions and spatial-only attention layers with our spatial frequency-aware convolution and attention modules to jointly model the complementary information from spatial and frequency domains in wavelet data. Our new architecture can be used as a drop-in replacement to the pixel-based network and is compatible with the vanilla DDPM training process. By explicitly modeling the wavelet signals, we find our model is able to generate images with higher quality on CIFAR-10, FFHQ, LSUN-Bedroom, and LSUN-Church datasets, than the pixel-based counterpart.
翻译:本文研究在 wavelet(小波)空间而非像素空间中,对去噪扩散概率模型(DDPM)进行视觉合成。考虑到小波变换在空间域和频率域中表示图像,我们精心设计了一种新颖的架构SFUNet,以有效捕捉两个域中的相关性。具体而言,在标准去噪U-Net处理像素数据时,我们用空间频率感知卷积和注意力模块补充2D卷积和仅空间注意力层,从而在小波数据中联合建模空间域和频率域的互补信息。我们的新架构可作为基于像素网络的即插即用替代品,并与原始DDPM训练过程兼容。通过显式建模小波信号,我们发现我们的模型在CIFAR-10、FFHQ、LSUN-Bedroom和LSUN-Church数据集上生成的图像质量优于基于像素的对应模型。