We present Deep Compression Autoencoder (DC-AE), a new family of autoencoder models for accelerating high-resolution diffusion models. Existing autoencoder models have demonstrated impressive results at a moderate spatial compression ratio (e.g., 8x), but fail to maintain satisfactory reconstruction accuracy for high spatial compression ratios (e.g., 64x). We address this challenge by introducing two key techniques: (1) Residual Autoencoding, where we design our models to learn residuals based on the space-to-channel transformed features to alleviate the optimization difficulty of high spatial-compression autoencoders; (2) Decoupled High-Resolution Adaptation, an efficient decoupled three-phases training strategy for mitigating the generalization penalty of high spatial-compression autoencoders. With these designs, we improve the autoencoder's spatial compression ratio up to 128 while maintaining the reconstruction quality. Applying our DC-AE to latent diffusion models, we achieve significant speedup without accuracy drop. For example, on ImageNet 512x512, our DC-AE provides 19.1x inference speedup and 17.9x training speedup on H100 GPU for UViT-H while achieving a better FID, compared with the widely used SD-VAE-f8 autoencoder. Our code is available at https://github.com/mit-han-lab/efficientvit.
翻译:本文提出深度压缩自编码器(DC-AE),这是一种用于加速高分辨率扩散模型的新型自编码器模型系列。现有自编码器模型在中等空间压缩比(例如8倍)下已展现出令人印象深刻的结果,但在高空间压缩比(例如64倍)下难以保持令人满意的重建精度。我们通过引入两项关键技术应对这一挑战:(1)残差自编码:我们设计模型基于空间-通道变换后的特征学习残差,以缓解高空间压缩自编码器的优化困难;(2)解耦式高分辨率适配:一种高效的三阶段解耦训练策略,用于减轻高空间压缩自编码器的泛化损失。通过上述设计,我们将自编码器的空间压缩比提升至128倍,同时保持重建质量。将DC-AE应用于潜在扩散模型后,我们在不损失精度的前提下实现了显著加速。例如,在ImageNet 512×512数据集上,相较于广泛使用的SD-VAE-f8自编码器,我们的DC-AE在H100 GPU上为UViT-H模型提供了19.1倍推理加速与17.9倍训练加速,同时获得了更优的FID分数。代码已开源:https://github.com/mit-han-lab/efficientvit。