Autoencoders empower state-of-the-art image and video generative models by compressing pixels into a latent space through visual tokenization. Although recent advances have alleviated the performance degradation of autoencoders under high compression ratios, addressing the training instability caused by GAN remains an open challenge. While improving spatial compression, we also aim to minimize the latent space dimensionality, enabling more efficient and compact representations. To tackle these challenges, we focus on improving the decoder's expressiveness. Concretely, we propose DGAE, which employs a diffusion model to guide the decoder in recovering informative signals that are not fully decoded from the latent representation. With this design, DGAE effectively mitigates the performance degradation under high spatial compression rates. At the same time, DGAE achieves state-of-the-art performance with a 2x smaller latent space. When integrated with Diffusion Models, DGAE demonstrates competitive performance on image generation for ImageNet-1K and shows that this compact latent representation facilitates faster convergence of the diffusion model.
翻译:自编码器通过视觉标记化将像素压缩至潜在空间,从而赋能了最先进的图像与视频生成模型。尽管近期研究已缓解了自编码器在高压缩比下的性能衰减,但由GAN引起的训练不稳定性仍是待解决的开放挑战。在提升空间压缩效率的同时,我们也致力于最小化潜在空间维度,以实现更高效紧凑的表示。为应对这些挑战,我们聚焦于提升解码器的表达能力。具体而言,我们提出DGAE模型,其采用扩散模型引导解码器恢复未能从潜在表示完全解码的信息信号。通过此设计,DGAE有效缓解了高空间压缩率下的性能衰减。同时,DGAE以2倍更小的潜在空间实现了最先进的性能。当与扩散模型结合时,DGAE在ImageNet-1K图像生成任务中展现出竞争优势,并证明这种紧凑潜在表示能加速扩散模型的收敛过程。