Image tokenizers map images to sequences of discrete tokens, and are a crucial component of autoregressive transformer-based image generation. The tokens are typically associated with spatial locations in the input image, arranged in raster scan order, which is not ideal for autoregressive modeling. In this paper, we propose to tokenize the image spectrum instead, obtained from a discrete wavelet transform (DWT), such that the sequence of tokens represents the image in a coarse-to-fine fashion. Our tokenizer brings several advantages: 1) it leverages that natural images are more compressible at high frequencies, 2) it can take and reconstruct images of different resolutions without retraining, 3) it improves the conditioning for next-token prediction -- instead of conditioning on a partial line-by-line reconstruction of the image, it takes a coarse reconstruction of the full image, 4) it enables partial decoding where the first few generated tokens can reconstruct a coarse version of the image, 5) it enables autoregressive models to be used for image upsampling. We evaluate the tokenizer reconstruction metrics as well as multiscale image generation, text-guided image upsampling and editing.
翻译:图像分词器将图像映射为离散标记序列,是基于自回归Transformer的图像生成系统的关键组件。现有方法通常将标记与输入图像的空间位置相关联,并按光栅扫描顺序排列,这种安排并不适合自回归建模。本文提出对图像频谱进行分词处理,该频谱通过离散小波变换(DWT)获得,使得标记序列能够以从粗到细的方式表征图像。我们提出的分词器具有以下优势:1)利用自然图像在高频区域更具可压缩性的特性;2)无需重新训练即可处理并重建不同分辨率的图像;3)改善下一标记预测的条件约束——不再依赖逐行重建的部分图像,而是基于完整图像的粗粒度重建;4)支持部分解码,仅需生成前几个标记即可重建图像的粗粒度版本;5)使自回归模型能够用于图像上采样任务。我们评估了该分词器的重建指标,并在多尺度图像生成、文本引导的图像上采样与编辑等任务中进行了验证。