Visual synthesis has recently seen significant leaps in performance, largely due to breakthroughs in generative models. Diffusion models have been a key enabler, as they excel in image diversity. However, this comes at the cost of slow training and synthesis, which is only partially alleviated by latent diffusion. To this end, flow matching is an appealing approach due to its complementary characteristics of faster training and inference but less diverse synthesis. We demonstrate that introducing flow matching between a frozen diffusion model and a convolutional decoder enables high-resolution image synthesis at reduced computational cost and model size. A small diffusion model can then effectively provide the necessary visual diversity, while flow matching efficiently enhances resolution and detail by mapping the small to a high-dimensional latent space. These latents are then projected to high-resolution images by the subsequent convolutional decoder of the latent diffusion approach. Combining the diversity of diffusion models, the efficiency of flow matching, and the effectiveness of convolutional decoders, state-of-the-art high-resolution image synthesis is achieved at $1024^2$ pixels with minimal computational cost. Further scaling up our method we can reach resolutions up to $2048^2$ pixels. Importantly, our approach is orthogonal to recent approximation and speed-up strategies for the underlying model, making it easily integrable into the various diffusion model frameworks.
翻译:视觉合成领域近期在性能上取得了显著突破,这主要归功于生成模型的重大进展。扩散模型因其在图像多样性方面的卓越表现,已成为关键推动技术。然而,这种优势是以训练与合成速度缓慢为代价的,而潜在扩散模型仅能部分缓解这一问题。为此,流匹配因其具有训练与推理速度更快(尽管合成多样性相对较低)的互补特性,成为一种极具吸引力的方法。我们证明,在冻结的扩散模型与卷积解码器之间引入流匹配,能够以降低的计算成本和模型规模实现高分辨率图像合成。小型扩散模型可有效提供必要的视觉多样性,而流匹配则通过将低维潜在空间映射到高维空间,高效地提升分辨率与细节表现。随后,潜在扩散方法中的卷积解码器将这些潜在表示投影为高分辨率图像。通过融合扩散模型的多样性优势、流匹配的高效性以及卷积解码器的有效性,我们以极低计算成本实现了1024^2像素级别的先进高分辨率图像合成。进一步扩展本方法后,分辨率可提升至2048^2像素。重要的是,我们的方法与近期针对底层模型的近似及加速策略具有正交性,可轻松集成到各类扩散模型框架中。