Existing text-to-image diffusion models excel at generating high-quality images, but face significant efficiency challenges when scaled to high resolutions, like 4K image generation. While previous research accelerates diffusion models in various aspects, it seldom handles the inherent redundancy within the latent space. To bridge this gap, this paper introduces DC-Gen, a general framework that accelerates text-to-image diffusion models by leveraging a deeply compressed latent space. Rather than a costly training-from-scratch approach, DC-Gen uses an efficient post-training pipeline to preserve the quality of the base model. A key challenge in this paradigm is the representation gap between the base model's latent space and a deeply compressed latent space, which can lead to instability during direct fine-tuning. To overcome this, DC-Gen first bridges the representation gap with a lightweight embedding alignment training. Once the latent embeddings are aligned, only a small amount of LoRA fine-tuning is needed to unlock the base model's inherent generation quality. We verify DC-Gen's effectiveness on SANA and FLUX.1-Krea. The resulting DC-Gen-SANA and DC-Gen-FLUX models achieve quality comparable to their base models but with a significant speedup. Specifically, DC-Gen-FLUX reduces the latency of 4K image generation by 53x on the NVIDIA H100 GPU. When combined with NVFP4 SVDQuant, DC-Gen-FLUX generates a 4K image in just 3.5 seconds on a single NVIDIA 5090 GPU, achieving a total latency reduction of 138x compared to the base FLUX.1-Krea model. Code: https://github.com/dc-ai-projects/DC-Gen.
翻译:现有文生图扩散模型在生成高质量图像方面表现出色,但在扩展至高分辨率(如4K图像生成)时面临显著的效率挑战。尽管先前研究在多个方面加速了扩散模型,却很少处理潜在空间内部固有的冗余性。为弥补这一空白,本文提出DC-Gen——一种通过利用深度压缩潜在空间来加速文生图扩散模型的通用框架。DC-Gen采用高效的后训练流程,而非成本高昂的从头训练方法,以保持基础模型的质量。该范式中的一个关键挑战是基础模型潜在空间与深度压缩潜在空间之间的表示差距,这可能导致直接微调时的不稳定性。为解决此问题,DC-Gen首先通过轻量级嵌入对齐训练弥合表示差距。待潜在嵌入对齐后,仅需少量LoRA微调即可释放基础模型固有的生成质量。我们在SANA和FLUX.1-Krea上验证了DC-Gen的有效性。所得DC-Gen-SANA和DC-Gen-FLUX模型在保持与基础模型相当质量的同时实现了显著加速。具体而言,DC-Gen-FLUX在NVIDIA H100 GPU上将4K图像生成的延迟降低了53倍。当结合NVFP4 SVDQuant时,DC-Gen-FLUX在单张NVIDIA 5090 GPU上仅需3.5秒即可生成4K图像,相比基础FLUX.1-Krea模型实现了总计138倍的延迟降低。代码:https://github.com/dc-ai-projects/DC-Gen。