Diffusion Probabilistic Models (DPMs) have emerged as the de facto approach for high-fidelity image synthesis, operating diffusion processes on continuous VAE latent, which significantly differ from the text generation methods employed by Large Language Models (LLMs). In this paper, we introduce a novel generative framework, the Recurrent Diffusion Probabilistic Model (RDPM), which enhances the diffusion process through a recurrent token prediction mechanism, thereby pioneering the field of Discrete Diffusion. By progressively introducing Gaussian noise into the latent representations of images and encoding them into vector-quantized tokens in a recurrent manner, RDPM facilitates a unique diffusion process on discrete-value domains. This process iteratively predicts the token codes for subsequent timesteps, transforming the initial standard Gaussian noise into the source data distribution, aligning with GPT-style models in terms of the loss function. RDPM demonstrates superior performance while benefiting from the speed advantage of requiring only a few inference steps. This model not only leverages the diffusion process to ensure high-quality generation but also converts continuous signals into a series of high-fidelity discrete tokens, thereby maintaining a unified optimization strategy with other discrete tokens, such as text. We anticipate that this work will contribute to the development of a unified model for multimodal generation, specifically by integrating continuous signal domains such as images, videos, and audio with text. We will release the code and model weights to the open-source community.
翻译:扩散概率模型已成为实现高保真图像合成的实际标准方法,其在连续变分自编码器潜在空间上执行扩散过程,这与大语言模型采用的文本生成方法存在显著差异。本文提出一种新颖的生成框架——循环扩散概率模型,该模型通过循环令牌预测机制增强扩散过程,从而开创了离散扩散的研究领域。通过以循环方式将高斯噪声逐步引入图像的潜在表示并将其编码为矢量量化令牌,RDPM实现了在离散值域上的独特扩散过程。该过程迭代预测后续时间步的令牌编码,将初始标准高斯噪声转化为源数据分布,其损失函数形式与GPT风格模型保持一致。RDPM在仅需少量推理步骤的速度优势下展现出卓越性能。该模型不仅利用扩散过程确保高质量生成,还将连续信号转化为一系列高保真离散令牌,从而与文本等其他离散令牌保持统一的优化策略。我们预期这项工作将推动多模态生成统一模型的发展,特别是实现图像、视频、音频等连续信号域与文本的融合。我们将向开源社区发布代码和模型权重。