ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis

Recently, token-based generation have demonstrated their effectiveness in image synthesis. As a representative example, non-autoregressive Transformers (NATs) can generate decent-quality images in a few steps. NATs perform generation in a progressive manner, where the latent tokens of a resulting image are incrementally revealed. At each step, the unrevealed image regions are padded with mask tokens and inferred by NAT. In this paper, we delve into the mechanisms behind the effectiveness of NATs and uncover two important patterns that naturally emerge from NATs: Spatially (within a step), although mask and visible tokens are processed uniformly by NATs, the interactions between them are highly asymmetric. In specific, mask tokens mainly gather information for decoding, while visible tokens tend to primarily provide information, and their deep representations can be built only upon themselves. Temporally (across steps), the interactions between adjacent generation steps mostly concentrate on updating the representations of a few critical tokens, while the computation for the majority of tokens is generally repetitive. Driven by these findings, we propose EfficientNAT (ENAT), a NAT model that explicitly encourages these critical interactions inherent in NATs. At the spatial level, we disentangle the computations of visible and mask tokens by encoding visible tokens independently, while decoding mask tokens conditioned on the fully encoded visible tokens. At the temporal level, we prioritize the computation of the critical tokens at each step, while maximally reusing previously computed token representations to supplement necessary information. ENAT improves the performance of NATs notably with significantly reduced computational cost. Experiments on ImageNet-256, ImageNet-512 and MS-COCO validate the effectiveness of ENAT. Code is available at https://github.com/LeapLabTHU/ENAT.

翻译：近年来，基于令牌的生成方法在图像合成领域展现出显著成效。以非自回归Transformer（NAT）为代表的方法，能够在有限步骤内生成质量可观的图像。NAT采用渐进式生成策略，通过逐步揭示目标图像的潜在令牌序列来实现图像合成。在每一生成步骤中，未揭示的图像区域被掩码令牌填充，并由NAT模型进行推断。本文深入探究了NAT有效性的内在机制，并发现其自然涌现的两个重要模式：在空间维度（单步内），尽管掩码令牌与可见令牌在NAT中采用统一处理方式，但二者间的交互呈现高度不对称性。具体而言，掩码令牌主要承担信息解码功能，而可见令牌则倾向于提供信息供给，其深层表征可仅基于自身信息构建。在时间维度（跨步骤间），相邻生成步骤间的交互主要集中于少数关键令牌的表征更新，而对大多数令牌的计算往往存在重复性。基于这些发现，我们提出高效非自回归Transformer（ENAT）模型，该模型显式强化了NAT固有的关键交互机制。在空间层面，我们通过独立编码可见令牌、并基于完整编码的可见令牌条件化解码掩码令牌，实现两类令牌计算过程的解耦。在时间层面，我们优先计算每步中的关键令牌，同时最大化复用先前计算的令牌表征以补充必要信息。ENAT在显著降低计算成本的同时，显著提升了NAT模型的性能表现。在ImageNet-256、ImageNet-512和MS-COCO数据集上的实验验证了ENAT的有效性。代码已开源：https://github.com/LeapLabTHU/ENAT。