Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective

Latent-based image generative models, such as Latent Diffusion Models (LDMs) and Mask Image Models (MIMs), have achieved notable success in image generation tasks. These models typically leverage reconstructive autoencoders like VQGAN or VAE to encode pixels into a more compact latent space and learn the data distribution in the latent space instead of directly from pixels. However, this practice raises a pertinent question: Is it truly the optimal choice? In response, we begin with an intriguing observation: despite sharing the same latent space, autoregressive models significantly lag behind LDMs and MIMs in image generation. This finding contrasts sharply with the field of NLP, where the autoregressive model GPT has established a commanding presence. To address this discrepancy, we introduce a unified perspective on the relationship between latent space and generative models, emphasizing the stability of latent space in image generative modeling. Furthermore, we propose a simple but effective discrete image tokenizer to stabilize the latent space for image generative modeling by applying K-Means on the latent features of self-supervised learning models. Experimental results show that image autoregressive modeling with our tokenizer (DiGIT) benefits both image understanding and image generation with the next token prediction principle, which is inherently straightforward for GPT models but challenging for other generative models. Remarkably, for the first time, a GPT-style autoregressive model for images outperforms LDMs, which also exhibits substantial improvement akin to GPT when scaling up model size. Our findings underscore the potential of an optimized latent space and the integration of discrete tokenization in advancing the capabilities of image generative models. The code is available at \url{https://github.com/DAMO-NLP-SG/DiGIT}.

翻译：基于潜在空间的图像生成模型，如潜在扩散模型（LDMs）和掩码图像模型（MIMs），在图像生成任务中取得了显著成功。这些模型通常利用重构式自编码器（如VQGAN或VAE）将像素编码到更紧凑的潜在空间，并在潜在空间中学习数据分布，而非直接从像素学习。然而，这种做法引发了一个相关问题：这真的是最优选择吗？对此，我们从一个有趣的观察出发：尽管共享相同的潜在空间，自回归模型在图像生成方面显著落后于LDMs和MIMs。这一发现与自然语言处理领域形成鲜明对比，在该领域中，自回归模型GPT已确立了主导地位。为解释这一差异，我们提出了一个关于潜在空间与生成模型之间关系的统一视角，强调潜在空间在图像生成建模中的稳定性。此外，我们提出了一种简单而有效的离散图像分词器，通过对自监督学习模型的潜在特征应用K-Means聚类来稳定图像生成建模的潜在空间。实验结果表明，使用我们的分词器（DiGIT）进行图像自回归建模，通过下一个词元预测原则，同时有益于图像理解和图像生成，这一原则对GPT模型本质上是直观的，但对其他生成模型则具有挑战性。值得注意的是，首次有GPT风格的图像自回归模型在性能上超越了LDMs，并且在扩大模型规模时也表现出类似GPT的显著提升。我们的发现强调了优化潜在空间以及整合离散分词在提升图像生成模型能力方面的潜力。代码可在 \url{https://github.com/DAMO-NLP-SG/DiGIT} 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日