Image tokenizers are crucial for visual generative models, e.g., diffusion models (DMs) and autoregressive (AR) models, as they construct the latent representation for modeling. Increasing token length is a common approach to improve the image reconstruction quality. However, tokenizers with longer token lengths are not guaranteed to achieve better generation quality. There exists a trade-off between reconstruction and generation quality regarding token length. In this paper, we investigate the impact of token length on both image reconstruction and generation and provide a flexible solution to the tradeoff. We propose ImageFolder, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling to improve both generation efficiency and quality. To enhance the representative capability without increasing token length, we leverage dual-branch product quantization to capture different contexts of images. Specifically, semantic regularization is introduced in one branch to encourage compacted semantic information while another branch is designed to capture the remaining pixel-level details. Extensive experiments demonstrate the superior quality of image generation and shorter token length with ImageFolder tokenizer.
翻译:图像分词器对于视觉生成模型(如扩散模型和自回归模型)至关重要,它们构建了用于建模的潜在表示。增加令牌长度是提升图像重建质量的常用方法。然而,具有较长令牌长度的分词器并不能保证获得更好的生成质量。在令牌长度方面,重建质量与生成质量之间存在权衡。本文研究了令牌长度对图像重建和生成的影响,并为这一权衡提供了灵活的解决方案。我们提出了ImageFolder,这是一种语义分词器,可提供空间对齐的图像令牌,这些令牌可在自回归建模过程中折叠,以提高生成效率和质量。为了在不增加令牌长度的前提下增强表示能力,我们利用双分支乘积量化来捕获图像的不同上下文。具体而言,我们在一个分支中引入语义正则化以鼓励压缩语义信息,而另一个分支则设计用于捕获剩余的像素级细节。大量实验证明,使用ImageFolder分词器能实现更优的图像生成质量和更短的令牌长度。