Advances in latent diffusion models (LDMs) have revolutionized high-resolution image generation, but the design space of the autoencoder that is central to these systems remains underexplored. In this paper, we introduce LiteVAE, a family of autoencoders for LDMs that leverage the 2D discrete wavelet transform to enhance scalability and computational efficiency over standard variational autoencoders (VAEs) with no sacrifice in output quality. We also investigate the training methodologies and the decoder architecture of LiteVAE and propose several enhancements that improve the training dynamics and reconstruction quality. Our base LiteVAE model matches the quality of the established VAEs in current LDMs with a six-fold reduction in encoder parameters, leading to faster training and lower GPU memory requirements, while our larger model outperforms VAEs of comparable complexity across all evaluated metrics (rFID, LPIPS, PSNR, and SSIM).
翻译:潜在扩散模型(LDMs)的进展已彻底改变了高分辨率图像生成,但作为这些系统核心的自编码器设计空间仍未得到充分探索。本文提出LiteVAE,这是一个面向LDMs的自编码器系列,其利用二维离散小波变换在保持输出质量不变的前提下,相比标准变分自编码器(VAEs)显著提升了可扩展性与计算效率。我们同时研究了LiteVAE的训练方法与解码器架构,并提出了多项能改善训练动态和重建质量的增强技术。我们的基础LiteVAE模型在编码器参数量减少六倍的情况下,达到了当前LDMs中成熟VAEs的质量水平,从而实现了更快的训练速度与更低的GPU内存需求;而我们的更大模型则在所有评估指标(rFID、LPIPS、PSNR和SSIM)上均优于复杂度相当的VAEs。