视觉基础模型可作为潜在扩散模型的优良分词器 (Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models)

The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizer. While recent works have explored incorporating Vision Foundation Models (VFMs) via distillation, we identify a fundamental flaw in this approach: it inevitably weakens the robustness of alignment with the original VFM, causing the aligned latents to deviate semantically under distribution shifts. In this paper, we bypass distillation by proposing a more direct approach: Vision Foundation Model Variational Autoencoder (VFM-VAE). To resolve the inherent tension between the VFM's semantic focus and the need for pixel-level fidelity, we redesign the VFM-VAE decoder with Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks, enabling high-quality reconstruction from spatially coarse VFM features. Furthermore, we provide a comprehensive analysis of representation dynamics during diffusion training, introducing the proposed SE-CKNNA metric as a more precise tool for this diagnosis. This analysis allows us to develop a joint tokenizer-diffusion alignment strategy that dramatically accelerates convergence. Our innovations in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.20 in merely 80 epochs (a 10x speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62, establishing direct VFM integration as a superior paradigm for LDMs.

翻译：潜在扩散模型（LDMs）的性能关键取决于其视觉分词器的质量。尽管近期研究探索了通过蒸馏方法融入视觉基础模型（VFMs），但我们发现该方法存在一个根本性缺陷：它不可避免地削弱了与原始VFM对齐的鲁棒性，导致对齐后的潜在表示在分布偏移下发生语义偏离。本文通过提出一种更直接的方法——视觉基础模型变分自编码器（VFM-VAE），绕过了蒸馏过程。为解决VFM的语义聚焦特性与像素级保真需求之间的固有矛盾，我们通过多尺度潜在融合模块和渐进式分辨率重建模块重新设计了VFM-VAE解码器，使其能够从空间粗糙的VFM特征实现高质量重建。此外，我们对扩散训练过程中的表征动态进行了全面分析，并引入提出的SE-CKNNA度量作为更精确的诊断工具。该分析使我们能够开发一种联合分词器-扩散对齐策略，从而显著加速收敛。我们在分词器设计和训练策略方面的创新带来了卓越的性能与效率：我们的系统仅用80个周期就达到了2.20的gFID（无CFG）（相比先前分词器实现了10倍加速）。继续训练至640个周期后，该系统进一步获得1.62的gFID（无CFG），确立了直接VFM集成作为LDMs的优越范式。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日