Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations. Code and interpretations are available at https://howlin-wang.github.io/svg/.
翻译:基于扩散的视觉生成领域的最新进展主要依赖于结合变分自编码器(VAE)的潜在扩散模型。尽管这种VAE+扩散范式能够实现高保真合成,但其存在训练效率有限、推理速度慢以及对更广泛视觉任务可迁移性差等问题。这些问题源于VAE潜在空间的一个关键局限:缺乏清晰的语义分离和强大的判别结构。我们的分析证实,这些特性不仅对感知和理解任务至关重要,而且对潜在扩散模型的稳定高效训练也极为重要。基于这一洞见,我们提出了SVG,一种无需变分自编码器的新型潜在扩散模型,它利用自监督表示进行视觉生成。SVG通过利用冻结的DINO特征构建了一个具有清晰语义判别性的特征空间,同时一个轻量级的残差分支捕获用于高保真重建的细粒度细节。扩散模型直接在这种语义结构化的潜在空间上进行训练,以促进更高效的学习。因此,SVG能够加速扩散训练,支持少步采样,并提升生成质量。实验结果进一步表明,SVG保留了底层自监督表示的语义和判别能力,为通向任务通用、高质量的视觉表示提供了一条原则性路径。代码和解读可在 https://howlin-wang.github.io/svg/ 获取。