This paper explores image modeling from the frequency space and introduces DCTdiff, an end-to-end diffusion generative paradigm that efficiently models images in the discrete cosine transform (DCT) space. We investigate the design space of DCTdiff and reveal the key design factors. Experiments on different frameworks (UViT, DiT), generation tasks, and various diffusion samplers demonstrate that DCTdiff outperforms pixel-based diffusion models regarding generative quality and training efficiency. Remarkably, DCTdiff can seamlessly scale up to high-resolution generation without using the latent diffusion paradigm. Finally, we illustrate several intriguing properties of DCT image modeling. For example, we provide a theoretical proof of why `image diffusion can be seen as spectral autoregression', bridging the gap between diffusion and autoregressive models. The effectiveness of DCTdiff and the introduced properties suggest a promising direction for image modeling in the frequency space. The code is at \url{https://github.com/forever208/DCTdiff}.
翻译:本文从频域空间探索图像建模,并提出了DCTdiff——一种在离散余弦变换(DCT)空间中高效建模图像的端到端扩散生成范式。我们系统研究了DCTdiff的设计空间,揭示了关键设计要素。在不同框架(UViT、DiT)、生成任务及多种扩散采样器上的实验表明,DCTdiff在生成质量与训练效率方面均优于基于像素的扩散模型。值得注意的是,DCTdiff无需借助潜在扩散范式即可无缝扩展至高分辨率生成。最后,我们阐述了DCT图像建模的若干有趣特性:例如通过理论证明揭示了“图像扩散可视为谱自回归过程”的本质,从而弥合了扩散模型与自回归模型之间的理论鸿沟。DCTdiff的有效性及其揭示的特性,为频域空间图像建模指明了富有前景的研究方向。代码发布于\url{https://github.com/forever208/DCTdiff}。