Recent advancements in diffusion models have positioned them at the forefront of image generation. Despite their superior performance, diffusion models are not without drawbacks; they are characterized by complex architectures and substantial computational demands, resulting in significant latency due to their iterative sampling process. To mitigate these limitations, we introduce a dual approach involving model miniaturization and a reduction in sampling steps, aimed at significantly decreasing model latency. Our methodology leverages knowledge distillation to streamline the U-Net and image decoder architectures, and introduces an innovative one-step DM training technique that utilizes feature matching and score distillation. We present two models, SDXS-512 and SDXS-1024, achieving inference speeds of approximately 100 FPS (30x faster than SD v1.5) and 30 FP (60x faster than SDXL) on a single GPU, respectively. Moreover, our training approach offers promising applications in image-conditioned control, facilitating efficient image-to-image translation.
翻译:近期扩散模型的进展使其处于图像生成的前沿。尽管性能优越,扩散模型并非没有缺陷:其架构复杂、计算需求庞大,且因迭代采样过程导致显著延迟。为缓解这些问题,我们提出了一种结合模型小型化与减少采样步骤的双重方法,旨在大幅降低模型延迟。我们的方法利用知识蒸馏精简U-Net与图像解码器架构,并引入一种基于特征匹配与分数蒸馏的创新一步式扩散模型训练技术。我们推出了两款模型——SDXS-512与SDXS-1024,在单块GPU上分别实现了约100 FPS(较SD v1.5快30倍)与30 FPS(较SDXL快60倍)的推理速度。此外,我们的训练方法在图像条件控制中展现出应用潜力,可高效实现图像到图像的转换。