Latent diffusion models have emerged as the dominant framework for high-fidelity and efficient image generation, owing to their ability to learn diffusion processes in compact latent spaces. However, while previous research has focused primarily on reconstruction accuracy and semantic alignment of the latent space, we observe that another critical factor, robustness to sampling perturbations, also plays a crucial role in determining generation quality. Through empirical and theoretical analyses, we show that the commonly used $β$-VAE-based tokenizers in latent diffusion models, tend to produce overly compact latent manifolds that are highly sensitive to stochastic perturbations during diffusion sampling, leading to visual degradation. To address this issue, we propose a simple yet effective solution that constructs a latent space robust to sampling perturbations while maintaining strong reconstruction fidelity. This is achieved by introducing a Variance Expansion loss that counteracts variance collapse and leverages the adversarial interplay between reconstruction and variance expansion to achieve an adaptive balance that preserves reconstruction accuracy while improving robustness to stochastic sampling. Extensive experiments demonstrate that our approach consistently enhances generation quality across different latent diffusion architectures, confirming that robustness in latent space is a key missing ingredient for stable and faithful diffusion sampling.
翻译:潜在扩散模型因其在紧凑潜在空间中学习扩散过程的能力,已成为高保真度与高效图像生成的主流框架。然而,尽管此前研究主要聚焦于潜在空间的重建精度与语义对齐,我们发现另一个关键因素——对采样扰动的鲁棒性——同样对生成质量起决定性作用。通过理论与实证分析,我们揭示潜在扩散模型中常用的基于β-VAE的令牌生成器,倾向于产生过度紧凑的潜在流形,这种流形对扩散采样过程中的随机扰动高度敏感,从而导致视觉质量退化。为解决该问题,我们提出一种简洁而有效的方案:在保持强重建保真度的同时,构建对采样扰动鲁棒的潜在空间。其核心在于引入一种方差扩展损失——该损失可抑制方差坍缩现象,并通过重建损失与方差扩展之间的对抗性博弈实现自适应平衡,从而在维持重建精度的同时提升对随机采样的鲁棒性。大量实验证明,本方法能持续提升不同潜在扩散架构的生成质量,证实潜在空间的鲁棒性正是实现稳定、保真扩散采样的关键缺失要素。