Image tokenization has enabled major advances in autoregressive image generation by providing compressed, discrete representations that are more efficient to process than raw pixels. While traditional approaches use 2D grid tokenization, recent methods like TiTok have shown that 1D tokenization can achieve high generation quality by eliminating grid redundancies. However, these methods typically use a fixed number of tokens and thus cannot adapt to an image's inherent complexity. We introduce FlexTok, a tokenizer that projects 2D images into variable-length, ordered 1D token sequences. For example, a 256x256 image can be resampled into anywhere from 1 to 256 discrete tokens, hierarchically and semantically compressing its information. By training a rectified flow model as the decoder and using nested dropout, FlexTok produces plausible reconstructions regardless of the chosen token sequence length. We evaluate our approach in an autoregressive generation setting using a simple GPT-style Transformer. On ImageNet, this approach achieves an FID<2 across 8 to 128 tokens, outperforming TiTok and matching state-of-the-art methods with far fewer tokens. We further extend the model to support to text-conditioned image generation and examine how FlexTok relates to traditional 2D tokenization. A key finding is that FlexTok enables next-token prediction to describe images in a coarse-to-fine "visual vocabulary", and that the number of tokens to generate depends on the complexity of the generation task.
翻译:图像标记化通过提供压缩的离散表示,比原始像素更高效处理,从而推动了自回归图像生成的重大进展。传统方法采用二维网格标记化,而近期如TiTok等方法表明,通过消除网格冗余,一维标记化能够实现高质量的图像生成。然而,这些方法通常使用固定数量的标记,因此无法适应图像的内在复杂性。本文提出FlexTok,一种将二维图像投影为可变长度、有序一维标记序列的标记器。例如,一张256x256的图像可被重采样为1至256个不等的离散标记,从而分层且语义地压缩其信息。通过训练校正流模型作为解码器并采用嵌套丢弃技术,FlexTok能够在任意选定标记序列长度下生成合理的重建结果。我们在自回归生成场景中使用简单的GPT风格Transformer评估了该方法。在ImageNet数据集上,该方法在8至128个标记范围内实现了FID<2,优于TiTok,并以更少的标记匹配了最先进方法的性能。我们进一步扩展模型以支持文本条件图像生成,并探讨了FlexTok与传统二维标记化的关联。关键发现是:FlexTok使下一标记预测能够以从粗到细的“视觉词汇”描述图像,且生成所需标记数量取决于生成任务的复杂度。