Vector-quantized variational autoencoders (VQ-VAEs) are discrete autoencoders that compress images into discrete tokens. However, they are difficult to train due to discretization. In this paper, we propose a simple yet effective technique dubbed Gaussian Quant (GQ), which first trains a Gaussian VAE under certain constraints and then converts it into a VQ-VAE without additional training. For conversion, GQ generates random Gaussian noise as a codebook and finds the closest noise vector to the posterior mean. Theoretically, we prove that when the logarithm of the codebook size exceeds the bits-back coding rate of the Gaussian VAE, a small quantization error is guaranteed. Practically, we propose a heuristic to train Gaussian VAEs for effective conversion, named the target divergence constraint (TDC). Empirically, we show that GQ outperforms previous VQ-VAEs, such as VQGAN, FSQ, LFQ, and BSQ, on both UNet and ViT architectures. Furthermore, TDC also improves previous Gaussian VAE discretization methods, such as TokenBridge. The source code is provided in the supplementary materials.
翻译:向量量化变分自编码器(VQ-VAEs)是一种将图像压缩为离散标记的离散自编码器。然而,由于其离散化特性,这类模型难以训练。本文提出了一种简单而有效的技术,称为高斯量化(GQ),该方法首先在特定约束下训练一个高斯变分自编码器,然后将其转换为VQ-VAE,而无需额外训练。在转换过程中,GQ生成随机高斯噪声作为码本,并找到与后验均值最接近的噪声向量。理论上,我们证明当码本大小的对数超过高斯变分自编码器的比特回传编码率时,可以保证较小的量化误差。在实际应用中,我们提出了一种启发式方法,称为目标散度约束(TDC),用于训练高斯变分自编码器以实现有效转换。实验表明,在UNet和ViT架构上,GQ均优于先前的VQ-VAE方法,如VQGAN、FSQ、LFQ和BSQ。此外,TDC也改进了先前的高斯变分自编码器离散化方法,例如TokenBridge。源代码已在补充材料中提供。