Diffusion models (DMs) have revolutionized generative learning. They utilize a diffusion process to encode data into a simple Gaussian distribution. However, encoding a complex, potentially multimodal data distribution into a single continuous Gaussian distribution arguably represents an unnecessarily challenging learning problem. We propose Discrete-Continuous Latent Variable Diffusion Models (DisCo-Diff) to simplify this task by introducing complementary discrete latent variables. We augment DMs with learnable discrete latents, inferred with an encoder, and train DM and encoder end-to-end. DisCo-Diff does not rely on pre-trained networks, making the framework universally applicable. The discrete latents significantly simplify learning the DM's complex noise-to-data mapping by reducing the curvature of the DM's generative ODE. An additional autoregressive transformer models the distribution of the discrete latents, a simple step because DisCo-Diff requires only few discrete variables with small codebooks. We validate DisCo-Diff on toy data, several image synthesis tasks as well as molecular docking, and find that introducing discrete latents consistently improves model performance. For example, DisCo-Diff achieves state-of-the-art FID scores on class-conditioned ImageNet-64/128 datasets with ODE sampler.
翻译:扩散模型(DMs)已经彻底改变了生成式学习。它们利用扩散过程将数据编码为简单的高斯分布。然而,将复杂的、可能多模态的数据分布编码为单一的连续高斯分布,无疑构成了一个不必要的困难学习问题。我们提出了离散-连续隐变量扩散模型(DisCo-Diff),通过引入互补的离散隐变量来简化这一任务。我们通过可学习的离散隐变量来增强扩散模型,这些隐变量通过编码器推断,并将扩散模型与编码器进行端到端训练。DisCo-Diff不依赖于预训练网络,使得该框架具有普适性。离散隐变量通过降低扩散模型生成常微分方程的曲率,显著简化了学习从噪声到数据的复杂映射过程。一个额外的自回归Transformer对离散隐变量的分布进行建模,由于DisCo-Diff仅需少量具有小型码本的离散变量,这一步变得非常简单。我们在玩具数据、多个图像合成任务以及分子对接任务上验证了DisCo-Diff,发现引入离散隐变量能持续提升模型性能。例如,在使用ODE采样器的情况下,DisCo-Diff在类别条件ImageNet-64/128数据集上取得了最先进的FID分数。