通过解耦表征对齐提升潜在扩散模型 (Boosting Latent Diffusion Models via Disentangled Representation Alignment)

Latent Diffusion Models (LDMs) generate high-quality images by operating in a compressed latent space, typically obtained through image tokenizers such as Variational Autoencoders (VAEs). In pursuit of a generation-friendly VAE, recent studies have explored leveraging Vision Foundation Models (VFMs) as representation alignment targets for VAEs, mirroring the approach commonly adopted for LDMs. Although this yields certain performance gains, using the same alignment target for both VAEs and LDMs overlooks their fundamentally different representational requirements. We advocate that while LDMs benefit from latents retaining high-level semantic concepts, VAEs should excel in semantic disentanglement, enabling encoding of attribute-level information in a structured way. To address this, we propose the Semantic disentangled VAE (Send-VAE), explicitly optimized for disentangled representation learning through aligning its latent space with the semantic hierarchy of pre-trained VFMs. Our approach employs a non-linear mapper network to transform VAE latents, aligning them with VFMs to bridge the gap between attribute-level disentanglement and high-level semantics, facilitating effective guidance for VAE learning. We evaluate semantic disentanglement via linear probing on attribute prediction tasks, showing strong correlation with improved generation performance. Finally, using Send-VAE, we train flow-based transformers SiTs; experiments show Send-VAE significantly speeds up training and achieves a state-of-the-art FID of 1.21 and 1.75 with and without classifier-free guidance on ImageNet 256x256.

翻译：潜在扩散模型通过在压缩的潜在空间中操作来生成高质量图像，该空间通常通过图像分词器（如变分自编码器）获得。为了追求一种对生成友好的VAE，近期研究探索了利用视觉基础模型作为VAE的表征对齐目标，这模仿了LDM通常采用的方法。尽管这带来了一定的性能提升，但对VAE和LDM使用相同的对齐目标忽视了它们根本不同的表征需求。我们认为，虽然LDM受益于保留高层语义概念的潜在表示，但VAE应擅长语义解耦，从而能够以结构化的方式编码属性级信息。为解决此问题，我们提出了语义解耦VAE，其通过将其潜在空间与预训练VFM的语义层次对齐，明确优化了解耦表征学习。我们的方法采用一个非线性映射网络来转换VAE潜在表示，使其与VFM对齐，以弥合属性级解耦与高层语义之间的差距，从而为VAE学习提供有效指导。我们通过在属性预测任务上的线性探测来评估语义解耦，结果显示其与生成性能的提升存在强相关性。最后，使用Send-VAE，我们训练了基于流的Transformer SiTs；实验表明，Send-VAE显著加快了训练速度，并在ImageNet 256x256数据集上，在使用和不使用无分类器指导的情况下，分别达到了1.21和1.75的最新FID分数。