RDPM: Solve Diffusion Probabilistic Models via Recurrent Token Prediction

Diffusion Probabilistic Models (DPMs) have emerged as the de facto approach for high-fidelity image synthesis, operating diffusion processes on continuous VAE latent, which significantly differ from the text generation methods employed by Large Language Models (LLMs). In this paper, we introduce a novel generative framework, the Recurrent Diffusion Probabilistic Model (RDPM), which enhances the diffusion process through a recurrent token prediction mechanism, thereby pioneering the field of Discrete Diffusion. By progressively introducing Gaussian noise into the latent representations of images and encoding them into vector-quantized tokens in a recurrent manner, RDPM facilitates a unique diffusion process on discrete-value domains. This process iteratively predicts the token codes for subsequent timesteps, transforming the initial standard Gaussian noise into the source data distribution, aligning with GPT-style models in terms of the loss function. RDPM demonstrates superior performance while benefiting from the speed advantage of requiring only a few inference steps. This model not only leverages the diffusion process to ensure high-quality generation but also converts continuous signals into a series of high-fidelity discrete tokens, thereby maintaining a unified optimization strategy with other discrete tokens, such as text. We anticipate that this work will contribute to the development of a unified model for multimodal generation, specifically by integrating continuous signal domains such as images, videos, and audio with text. We will release the code and model weights to the open-source community.

翻译：扩散概率模型已成为实现高保真图像合成的实际标准方法，其在连续变分自编码器潜在空间上执行扩散过程，这与大语言模型采用的文本生成方法存在显著差异。本文提出一种新颖的生成框架——循环扩散概率模型，该模型通过循环令牌预测机制增强扩散过程，从而开创了离散扩散的研究领域。通过以循环方式将高斯噪声逐步引入图像的潜在表示并将其编码为矢量量化令牌，RDPM实现了在离散值域上的独特扩散过程。该过程迭代预测后续时间步的令牌编码，将初始标准高斯噪声转化为源数据分布，其损失函数形式与GPT风格模型保持一致。RDPM在仅需少量推理步骤的速度优势下展现出卓越性能。该模型不仅利用扩散过程确保高质量生成，还将连续信号转化为一系列高保真离散令牌，从而与文本等其他离散令牌保持统一的优化策略。我们预期这项工作将推动多模态生成统一模型的发展，特别是实现图像、视频、音频等连续信号域与文本的融合。我们将向开源社区发布代码和模型权重。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日