Despite their fundamental role, it remains unclear what properties could make tokenizers more effective for generative modeling. We observe that modern generative models share a conceptually similar training objective -- reconstructing clean signals from corrupted inputs, such as signals degraded by Gaussian noise or masking -- a process we term denoising. Motivated by this insight, we propose aligning tokenizer embeddings directly with the downstream denoising objective, encouraging latent embeddings that remain reconstructable even under significant corruption. To achieve this, we introduce the Latent Denoising Tokenizer (l-DeTok), a simple yet highly effective tokenizer trained to reconstruct clean images from latent embeddings corrupted via interpolative noise or random masking. Extensive experiments on class-conditioned (ImageNet 256x256 and 512x512) and text-conditioned (MSCOCO) image generation benchmarks demonstrate that our l-DeTok consistently improves generation quality across six representative generative models compared to prior tokenizers. Our findings highlight denoising as a fundamental design principle for tokenizer development, and we hope it could motivate new perspectives for future tokenizer design.
翻译:尽管分词器在生成模型中扮演着基础性角色,但何种特性能够使其对生成建模更为有效,这一问题仍未明晰。我们观察到,现代生成模型共享一个概念上相似的训练目标——从被破坏的输入中重建干净信号,例如被高斯噪声或掩码所退化的信号——我们将此过程称为去噪。受此洞见的启发,我们提出将分词器嵌入直接与下游的去噪目标对齐,鼓励潜在嵌入即使在显著破坏下仍能保持可重建性。为实现此目标,我们引入了潜在去噪分词器(l-DeTok),这是一种简单而高效的分词器,其训练目标是从经由插值噪声或随机掩码破坏的潜在嵌入中重建干净图像。在类别条件(ImageNet 256x256 和 512x512)和文本条件(MSCOCO)图像生成基准上的大量实验表明,与先前的分词器相比,我们的 l-DeTok 在六种代表性生成模型上持续提升了生成质量。我们的研究结果凸显了去噪作为分词器开发的一项基本设计原则,并希望它能激发未来分词器设计的新视角。