Autoregressive models have shown remarkable success in image generation by adapting sequential prediction techniques from language modeling. However, applying these approaches to images requires discretizing continuous pixel data through vector quantization methods like VQ-VAE. To alleviate the quantization errors that existed in VQ-VAE, recent works tend to use larger codebooks. However, this will accordingly expand vocabulary size, complicating the autoregressive modeling task. This paper aims to find a way to enjoy the benefits of large codebooks without making autoregressive modeling more difficult. Through empirical investigation, we discover that tokens with similar codeword representations produce similar effects on the final generated image, revealing significant redundancy in large codebooks. Based on this insight, we propose to predict tokens from coarse to fine (CTF), realized by assigning the same coarse label for similar tokens. Our framework consists of two stages: (1) an autoregressive model that sequentially predicts coarse labels for each token in the sequence, and (2) an auxiliary model that simultaneously predicts fine-grained labels for all tokens conditioned on their coarse labels. Experiments on ImageNet demonstrate our method's superior performance, achieving an average improvement of 59 points in Inception Score compared to baselines. Notably, despite adding an inference step, our approach achieves faster sampling speeds.
翻译:自回归模型通过借鉴语言建模中的序列预测技术,在图像生成领域取得了显著成功。然而,将这些方法应用于图像需要通过向量量化方法(如VQ-VAE)将连续的像素数据离散化。为了缓解VQ-VAE中存在的量化误差,近期研究倾向于使用更大的码本。然而,这相应地会扩大词汇表规模,使自回归建模任务复杂化。本文旨在寻找一种方法,既能享受大码本带来的优势,又不会增加自回归建模的难度。通过实证研究,我们发现具有相似码字表示的令牌对最终生成图像产生相似的影响,揭示了大码本中存在显著冗余。基于这一洞见,我们提出从粗到细(CTF)预测令牌,其实现方式是为相似令牌分配相同的粗粒度标签。我们的框架包含两个阶段:(1)一个自回归模型,顺序预测序列中每个令牌的粗粒度标签;(2)一个辅助模型,在给定粗粒度标签的条件下,同时预测所有令牌的细粒度标签。在ImageNet上的实验证明了我们方法的优越性能,与基线相比,在Inception Score上平均提升了59分。值得注意的是,尽管增加了一个推理步骤,我们的方法实现了更快的采样速度。