The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizer. While recent works have explored incorporating Vision Foundation Models (VFMs) via distillation, we identify a fundamental flaw in this approach: it inevitably weakens the robustness of alignment with the original VFM, causing the aligned latents to deviate semantically under distribution shifts. In this paper, we bypass distillation by proposing a more direct approach: Vision Foundation Model Variational Autoencoder (VFM-VAE). To resolve the inherent tension between the VFM's semantic focus and the need for pixel-level fidelity, we redesign the VFM-VAE decoder with Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks, enabling high-quality reconstruction from spatially coarse VFM features. Furthermore, we provide a comprehensive analysis of representation dynamics during diffusion training, introducing the proposed SE-CKNNA metric as a more precise tool for this diagnosis. This analysis allows us to develop a joint tokenizer-diffusion alignment strategy that dramatically accelerates convergence. Our innovations in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62, establishing direct VFM integration as a superior paradigm for LDMs.
翻译:潜在扩散模型(LDMs)的性能关键取决于其视觉分词器的质量。尽管近期研究探索了通过蒸馏方法融入视觉基础模型(VFMs),但我们发现该方法存在一个根本性缺陷:它不可避免地削弱了与原始VFM对齐的鲁棒性,导致对齐后的潜在表示在分布偏移下发生语义偏离。本文通过提出一种更直接的方法——视觉基础模型变分自编码器(VFM-VAE),绕过了蒸馏过程。为解决VFM的语义聚焦特性与像素级保真需求之间的固有矛盾,我们通过多尺度潜在融合模块和渐进式分辨率重建模块重新设计了VFM-VAE解码器,使其能够从空间粗糙的VFM特征实现高质量重建。此外,我们对扩散训练过程中的表征动态进行了全面分析,并引入提出的SE-CKNNA度量作为更精确的诊断工具。该分析使我们能够开发一种联合分词器-扩散对齐策略,从而显著加速收敛。我们在分词器设计和训练策略方面的创新带来了卓越的性能与效率:我们的系统仅用80个周期就达到了2.20的gFID(无CFG)(相比先前分词器实现了10倍加速)。继续训练至640个周期后,该系统进一步获得1.62的gFID(无CFG),确立了直接VFM集成作为LDMs的优越范式。