Diffusion models are powerful, but they require a lot of time and data to train. We propose Patch Diffusion, a generic patch-wise training framework, to significantly reduce the training time costs while improving data efficiency, which thus helps democratize diffusion model training to broader users. At the core of our innovations is a new conditional score function at the patch level, where the patch location in the original image is included as additional coordinate channels, while the patch size is randomized and diversified throughout training to encode the cross-region dependency at multiple scales. Sampling with our method is as easy as in the original diffusion model. Through Patch Diffusion, we could achieve $\mathbf{\ge 2\times}$ faster training, while maintaining comparable or better generation quality. Patch Diffusion meanwhile improves the performance of diffusion models trained on relatively small datasets, $e.g.$, as few as 5,000 images to train from scratch. We achieve outstanding FID scores in line with state-of-the-art benchmarks: 1.77 on CelebA-64$\times$64, 1.93 on AFHQv2-Wild-64$\times$64, and 2.72 on ImageNet-256$\times$256. We share our code and pre-trained models at https://github.com/Zhendong-Wang/Patch-Diffusion.
翻译:扩散模型功能强大,但训练时需要大量时间和数据。我们提出Patch Diffusion——一种通用的逐块训练框架,能够显著降低训练时间成本并提升数据效率,从而推动扩散模型训练向更广泛用户普及。该创新的核心是设计了一种新的块级条件分数函数:将补丁在原始图像中的位置作为额外坐标通道引入,同时在训练过程中随机化并多样化补丁尺寸,以编码多尺度跨区域依赖关系。使用我们的方法进行采样与原始扩散模型同样简便。通过Patch Diffusion,我们能在保持相当或更优生成质量的同时,实现$\mathbf{\ge 2\times}$倍的训练加速。此外,该方法还能改善在相对小型数据集(例如仅用5,000张图像从头训练)上训练的扩散模型性能。我们在与现有最优基准的对比中取得了卓越的FID分数:CelebA-64$\times$64达1.77,AFHQv2-Wild-64$\times$64达1.93,ImageNet-256$\times$256达2.72。相关代码与预训练模型已在https://github.com/Zhendong-Wang/Patch-Diffusion开源。