Vector Quantized Variational Autoencoders (VQ-VAEs) are fundamental models that compress continuous visual data into discrete tokens. Existing methods have tried to improve the quantization strategy for better reconstruction quality, however, there still exists a large gap between VQ-VAEs and VAEs. To narrow this gap, we propose MGVQ, a novel method to augment the representation capability of discrete codebooks, facilitating easier optimization for codebooks and minimizing information loss, thereby enhancing reconstruction quality. Specifically, we propose to retain the latent dimension to preserve encoded features and incorporate a set of sub-codebooks for quantization. Furthermore, we construct comprehensive zero-shot benchmarks featuring resolutions of 512p and 2k to evaluate the reconstruction performance of existing methods rigorously. MGVQ achieves the state-of-the-art performance on both ImageNet and 8 zero-shot benchmarks across all VQ-VAEs. Notably, compared with SD-VAE, we outperform them on ImageNet significantly, with rFID 0.49 v.s. 0.91, and achieve superior PSNR on all zero-shot benchmarks. These results highlight the superiority of MGVQ in reconstruction and pave the way for preserving fidelity in HD image processing tasks. Code will be publicly available at https://github.com/MKJia/MGVQ.
翻译:向量量化变分自编码器(VQ-VAE)是将连续视觉数据压缩为离散标记的基础模型。现有方法试图改进量化策略以提升重建质量,然而VQ-VAE与VAE之间仍存在显著差距。为缩小这一差距,我们提出MGVQ——一种增强离散码本表示能力的新方法,该方法通过优化码本训练难度并最小化信息损失,从而提升重建质量。具体而言,我们提出保留潜在维度以维持编码特征,并引入一组子码本进行量化。此外,我们构建了包含512p与2k分辨率的综合零样本基准测试集,以严格评估现有方法的重建性能。MGVQ在ImageNet及所有VQ-VAE模型的8个零样本基准测试中均达到最优性能。值得注意的是,相较于SD-VAE,我们在ImageNet上以rFID 0.49对比0.91取得显著优势,并在所有零样本基准测试中获得更优的PSNR指标。这些结果凸显了MGVQ在重建任务中的优越性,为高保真度高清图像处理任务开辟了新路径。代码将公开于https://github.com/MKJia/MGVQ。