Latent diffusion models (LDMs) dominate high-quality image generation, yet integrating representation learning with generative modeling remains a challenge. We introduce a novel generative image modeling framework that seamlessly bridges this gap by leveraging a diffusion model to jointly model low-level image latents (from a variational autoencoder) and high-level semantic features (from a pretrained self-supervised encoder like DINO). Our latent-semantic diffusion approach learns to generate coherent image-feature pairs from pure noise, significantly enhancing both generative quality and training efficiency, all while requiring only minimal modifications to standard Diffusion Transformer architectures. By eliminating the need for complex distillation objectives, our unified design simplifies training and unlocks a powerful new inference strategy: Representation Guidance, which leverages learned semantics to steer and refine image generation. Evaluated in both conditional and unconditional settings, our method delivers substantial improvements in image quality and training convergence speed, establishing a new direction for representation-aware generative modeling. Project page and code: https://representationdiffusion.github.io
翻译:潜在扩散模型(LDMs)主导了高质量图像生成领域,然而将表征学习与生成式建模相结合仍是一个挑战。我们提出了一种新颖的生成式图像建模框架,通过利用扩散模型联合建模低级图像潜在表示(来自变分自编码器)和高级语义特征(来自预训练的自监督编码器如DINO),无缝地弥合了这一差距。我们的潜在-语义扩散方法学习从纯噪声中生成连贯的图像-特征对,显著提升了生成质量和训练效率,同时仅需对标准Diffusion Transformer架构进行最小修改。通过消除对复杂蒸馏目标的需求,我们的统一设计简化了训练过程,并解锁了一种强大的新推理策略:表征引导,该策略利用学习到的语义来引导和优化图像生成。在条件生成和无条件生成两种设定下的评估表明,我们的方法在图像质量和训练收敛速度方面均取得了显著提升,为表征感知的生成式建模开辟了新的方向。项目页面与代码:https://representationdiffusion.github.io