End-to-End Training for Unified Tokenization and Latent Denoising

from arxiv, First two authors contributed equally. Project: https://xingjianbai.com/unite-tokenization-generation/ Code: https://github.com/ShivamDuggal4/UNITE-tokenization-generation

Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propose UNITE - an autoencoder architecture for unified tokenization and latent diffusion. UNITE consists of a Generative Encoder that serves as both image tokenizer and latent generator via weight sharing. Our key insight is that tokenization and generation can be viewed as the same latent inference problem under different conditioning regimes: tokenization infers latents from fully observed images, whereas generation infers them from noise together with text or class conditioning. Motivated by this, we introduce a single-stage training procedure that jointly optimizes both tasks via two forward passes through the same Generative Encoder. The shared parameters enable gradients to jointly shape the latent space, encouraging a "common latent language". Across image and molecule modalities, UNITE achieves near state of the art performance without adversarial losses or pretrained encoders (e.g., DINO), reaching FID 2.12 and 1.73 for Base and Large models on ImageNet 256 x 256. We further analyze the Generative Encoder through the lenses of representation alignment and compression. These results show that single stage joint training of tokenization & generation from scratch is feasible.

翻译：隐空间扩散模型（LDMs）通过在学习到的隐空间中操作，实现了高保真合成。然而，训练最先进的LDM需要复杂的分阶段流程：必须首先训练分词器，随后才能在冻结的隐空间中训练扩散模型。我们提出UNITE——一种用于统一分词与隐空间扩散的自编码器架构。UNITE包含一个生成式编码器，通过参数共享同时充当图像分词器和隐空间生成器。我们的核心见解在于，分词和生成可视为在不同条件设置下的同一隐空间推理问题：分词从完全观测的图像中推断隐变量，而生成则从噪声结合文本或类别条件中推断隐变量。基于这一动机，我们提出一种单阶段训练流程，通过同一生成式编码器的两次前向传播联合优化两项任务。共享参数使得梯度能共同塑造隐空间，从而促进"通用隐语言"的形成。在图像和分子模态中，UNITE在无需对抗损失或预训练编码器（如DINO）的情况下，实现了接近最优的性能：在ImageNet 256×256数据集上，Base和Large模型的FID分别达到2.12和1.73。我们进一步从表示对齐与压缩的角度分析了生成式编码器。这些结果表明，从零开始进行分词与生成的单阶段联合训练是可行的。