In this paper, we take a new approach to autoregressive image generation that is based on two main ingredients. The first is wavelet image coding, which allows to tokenize the visual details of an image from coarse to fine details by ordering the information starting with the most significant bits of the most significant wavelet coefficients. The second is a variant of a language transformer whose architecture is re-designed and optimized for token sequences in this 'wavelet language'. The transformer learns the significant statistical correlations within a token sequence, which are the manifestations of well-known correlations between the wavelet subbands at various resolutions. We show experimental results with conditioning on the generation process.
翻译:本文提出了一种基于两大核心要素的自回归图像生成新方法。首先,小波图像编码技术通过从最重要小波系数的最高有效位开始排序信息,实现了从粗粒度到细粒度视觉细节的图像符号化表征。其次,我们采用一种针对这种"小波语言"符号序列重新设计并优化的语言Transformer变体架构。该Transformer能够学习符号序列内部的重要统计相关性,这些相关性正是多分辨率小波子带间经典关联关系的具体表现。我们展示了在生成过程中施加条件约束的实验结果。