Vector Quantized Diffusion Model for Text-to-Image Synthesis

We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation. This method is based on a vector quantized variational autoencoder (VQ-VAE) whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). We find that this latent-space method is well-suited for text-to-image generation tasks because it not only eliminates the unidirectional bias with existing methods but also allows us to incorporate a mask-and-replace diffusion strategy to avoid the accumulation of errors, which is a serious problem with existing methods. Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results when compared with conventional autoregressive (AR) models with similar numbers of parameters. Compared with previous GAN-based text-to-image methods, our VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin. Finally, we show that the image generation computation in our method can be made highly efficient by reparameterization. With traditional AR methods, the text-to-image generation time increases linearly with the output image resolution and hence is quite time consuming even for normal size images. The VQ-Diffusion allows us to achieve a better trade-off between quality and speed. Our experiments indicate that the VQ-Diffusion model with the reparameterization is fifteen times faster than traditional AR methods while achieving a better image quality.

翻译：我们为图像生成提供了矢量定量扩散模型(VQ-Difulation) 。这种方法基于一个矢量定量变异自动coder(VQ-VAE), 其潜在空间以最近开发的Denoising Difmission 概率模型(DDPM)的有条件变体为模型。我们发现,这种潜空方法非常适合文本到图像生成任务, 因为它不仅消除了现有方法的单向偏差, 而且还使我们能够采用一种遮蔽和替换战略, 以避免错误的累积, 这是现有方法的一个严重问题。我们的实验显示, VQ- 扩散与最近开发的常规自动递减模型(ARPM)相比,其潜在的空间生成效果要好得多。与以前基于 GAN 的文本到图像生成方法相比, 我们的VQ- Difilveilf 不仅可以处理更复杂的场景色, 还可以大大提高图像合成质量。最后, 我们显示, 我们的图像生成模型甚至更精确地计算了我们的方法中, 的图像生成速度和 AR解算算算算算算得非常高效的频率。