By embedding discrete representations into a continuous latent space, we can leverage continuous-space latent diffusion models to handle generative modeling of discrete data. However, despite their initial success, most latent diffusion methods rely on fixed pretrained embeddings, limiting the benefits of joint training with the diffusion model. While jointly learning the embedding (via reconstruction loss) and the latent diffusion model (via score matching loss) could enhance performance, end-to-end training risks embedding collapse, degrading generation quality. To mitigate this issue, we introduce VQ-LCMD, a continuous-space latent diffusion framework within the embedding space that stabilizes training. VQ-LCMD uses a novel training objective combining the joint embedding-diffusion variational lower bound with a consistency-matching (CM) loss, alongside a shifted cosine noise schedule and random dropping strategy. Experiments on several benchmarks show that the proposed VQ-LCMD yields superior results on FFHQ, LSUN Churches, and LSUN Bedrooms compared to discrete-state latent diffusion models. In particular, VQ-LCMD achieves an FID of 6.81 for class-conditional image generation on ImageNet with 50 steps.
翻译:通过将离散表示嵌入连续潜在空间,我们可以利用连续空间潜在扩散模型处理离散数据的生成建模。然而,尽管取得了初步成功,大多数潜在扩散方法依赖于固定的预训练嵌入,限制了与扩散模型联合训练的收益。虽然联合学习嵌入(通过重建损失)和潜在扩散模型(通过分数匹配损失)可能提升性能,但端到端训练存在嵌入崩溃的风险,从而降低生成质量。为缓解此问题,我们提出了VQ-LCMD——一种嵌入空间内的连续空间潜在扩散框架,能够稳定训练过程。VQ-LCMD采用了一种新颖的训练目标,将联合嵌入-扩散变分下界与一致性匹配损失相结合,并辅以移位的余弦噪声调度和随机丢弃策略。在多个基准测试上的实验表明,相较于离散状态潜在扩散模型,所提出的VQ-LCMD在FFHQ、LSUN Churches和LSUN Bedrooms数据集上取得了更优的结果。特别地,VQ-LCMD在ImageNet数据集上进行50步类别条件图像生成时,实现了6.81的FID分数。