Latent-based image generative models, such as Latent Diffusion Models (LDMs) and Mask Image Models (MIMs), have achieved notable success in image generation tasks. These models typically leverage reconstructive autoencoders like VQGAN or VAE to encode pixels into a more compact latent space and learn the data distribution in the latent space instead of directly from pixels. However, this practice raises a pertinent question: Is it truly the optimal choice? In response, we begin with an intriguing observation: despite sharing the same latent space, autoregressive models significantly lag behind LDMs and MIMs in image generation. This finding contrasts sharply with the field of NLP, where the autoregressive model GPT has established a commanding presence. To address this discrepancy, we introduce a unified perspective on the relationship between latent space and generative models, emphasizing the stability of latent space in image generative modeling. Furthermore, we propose a simple but effective discrete image tokenizer to stabilize the latent space for image generative modeling by applying K-Means on the latent features of self-supervised learning models. Experimental results show that image autoregressive modeling with our tokenizer (DiGIT) benefits both image understanding and image generation with the next token prediction principle, which is inherently straightforward for GPT models but challenging for other generative models. Remarkably, for the first time, a GPT-style autoregressive model for images outperforms LDMs, which also exhibits substantial improvement akin to GPT when scaling up model size. Our findings underscore the potential of an optimized latent space and the integration of discrete tokenization in advancing the capabilities of image generative models. The code is available at \url{https://github.com/DAMO-NLP-SG/DiGIT}.
翻译:基于潜在空间的图像生成模型,如潜在扩散模型(LDMs)和掩码图像模型(MIMs),在图像生成任务中取得了显著成功。这些模型通常利用重构式自编码器(如VQGAN或VAE)将像素编码到更紧凑的潜在空间,并在潜在空间中学习数据分布,而非直接从像素学习。然而,这种做法引发了一个相关问题:这真的是最优选择吗?对此,我们从一个有趣的观察出发:尽管共享相同的潜在空间,自回归模型在图像生成方面显著落后于LDMs和MIMs。这一发现与自然语言处理领域形成鲜明对比,在该领域中,自回归模型GPT已确立了主导地位。为解释这一差异,我们提出了一个关于潜在空间与生成模型之间关系的统一视角,强调潜在空间在图像生成建模中的稳定性。此外,我们提出了一种简单而有效的离散图像分词器,通过对自监督学习模型的潜在特征应用K-Means聚类来稳定图像生成建模的潜在空间。实验结果表明,使用我们的分词器(DiGIT)进行图像自回归建模,通过下一个词元预测原则,同时有益于图像理解和图像生成,这一原则对GPT模型本质上是直观的,但对其他生成模型则具有挑战性。值得注意的是,首次有GPT风格的图像自回归模型在性能上超越了LDMs,并且在扩大模型规模时也表现出类似GPT的显著提升。我们的发现强调了优化潜在空间以及整合离散分词在提升图像生成模型能力方面的潜力。代码可在 \url{https://github.com/DAMO-NLP-SG/DiGIT} 获取。