An Image is Worth 32 Tokens for Reconstruction and Generation

from arxiv, A compact 1D Image Tokenization method, leading to SOTA generation performance while being substantially faster. Project page at https://yucornetto.github.io/projects/titok.html

Recent advancements in generative models have highlighted the crucial role of image tokenization in the efficient synthesis of high-resolution images. Tokenization, which transforms images into latent representations, reduces computational demands compared to directly processing pixels and enhances the effectiveness and efficiency of the generation process. Prior methods, such as VQGAN, typically utilize 2D latent grids with fixed downsampling factors. However, these 2D tokenizations face challenges in managing the inherent redundancies present in images, where adjacent regions frequently display similarities. To overcome this issue, we introduce Transformer-based 1-Dimensional Tokenizer (TiTok), an innovative approach that tokenizes images into 1D latent sequences. TiTok provides a more compact latent representation, yielding substantially more efficient and effective representations than conventional techniques. For example, a 256 x 256 x 3 image can be reduced to just 32 discrete tokens, a significant reduction from the 256 or 1024 tokens obtained by prior methods. Despite its compact nature, TiTok achieves competitive performance to state-of-the-art approaches. Specifically, using the same generator framework, TiTok attains 1.97 gFID, outperforming MaskGIT baseline significantly by 4.21 at ImageNet 256 x 256 benchmark. The advantages of TiTok become even more significant when it comes to higher resolution. At ImageNet 512 x 512 benchmark, TiTok not only outperforms state-of-the-art diffusion model DiT-XL/2 (gFID 2.74 vs. 3.04), but also reduces the image tokens by 64x, leading to 410x faster generation process. Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.

翻译：近期生成模型的进展突显了图像标记化在高分辨率图像高效合成中的关键作用。与直接处理像素相比，将图像转化为潜在表示的标记化技术能降低计算需求，并提升生成过程的效果与效率。现有方法（如VQGAN）通常采用具有固定下采样因子的二维潜在网格。然而，这些二维标记化方案在处理图像固有冗余性时面临挑战，因为相邻区域常呈现高度相似性。为克服此问题，我们提出了基于Transformer的一维标记器（TiTok），这是一种将图像编码为一维潜在序列的创新方法。TiTok提供了一种更为紧凑的潜在表示，相比传统技术能实现显著更高效的表征。例如，一幅256×256×3的图像可被压缩至仅32个离散标记，较先前方法获得的256或1024个标记实现了大幅缩减。尽管结构紧凑，TiTok仍能达到与前沿方法相竞争的性能。具体而言，在相同生成器框架下，TiTok在ImageNet 256×256基准测试中取得了1.97的gFID分数，较MaskGIT基线显著提升4.21分。在处理更高分辨率图像时，TiTok的优势更为突出：在ImageNet 512×512基准测试中，TiTok不仅超越了前沿扩散模型DiT-XL/2（gFID 2.74 vs. 3.04），还将图像标记数量减少64倍，从而使生成过程加速410倍。我们性能最优的变体模型能显著超越DiT-XL/2（gFID 2.13 vs. 3.04），同时仍能以74倍的速度生成高质量样本。