Recent advances in autoregressive (AR) models with continuous tokens for image generation show promising results by eliminating the need for discrete tokenization. However, these models face efficiency challenges due to their sequential token generation nature and reliance on computationally intensive diffusion-based sampling. We present ECAR (Efficient Continuous Auto-Regressive Image Generation via Multistage Modeling), an approach that addresses these limitations through two intertwined innovations: (1) a stage-wise continuous token generation strategy that reduces computational complexity and provides progressively refined token maps as hierarchical conditions, and (2) a multistage flow-based distribution modeling method that transforms only partial-denoised distributions at each stage comparing to complete denoising in normal diffusion models. Holistically, ECAR operates by generating tokens at increasing resolutions while simultaneously denoising the image at each stage. This design not only reduces token-to-image transformation cost by a factor of the stage number but also enables parallel processing at the token level. Our approach not only enhances computational efficiency but also aligns naturally with image generation principles by operating in continuous token space and following a hierarchical generation process from coarse to fine details. Experimental results demonstrate that ECAR achieves comparable image quality to DiT Peebles & Xie [2023] while requiring 10$\times$ FLOPs reduction and 5$\times$ speedup to generate a 256$\times$256 image.
翻译:近年来,基于连续令牌的自回归模型在图像生成领域取得了显著进展,通过消除离散令牌化的需求展现出广阔前景。然而,由于这些模型采用顺序令牌生成机制,且依赖计算密集的基于扩散的采样方法,其效率面临挑战。本文提出ECAR(基于多阶段建模的高效连续自回归图像生成方法),通过两项相互关联的创新解决这些局限性:(1)分阶段连续令牌生成策略,该策略通过层级化条件逐步细化令牌图,有效降低计算复杂度;(2)多阶段基于流的分布建模方法,相较于常规扩散模型的完全去噪过程,本方法在每阶段仅对部分去噪分布进行变换。整体而言,ECAR通过在逐级提升的分辨率上生成令牌,同时在每个阶段对图像进行去噪操作。该设计不仅将令牌到图像的转换成本降低为阶段数的倒数倍,还实现了令牌层级的并行处理。我们的方法不仅提升了计算效率,而且通过在连续令牌空间中操作、遵循从粗粒度到细粒度的层级生成过程,自然契合图像生成的基本原理。实验结果表明,在生成256×256图像时,ECAR在实现与DiT Peebles & Xie [2023]相当图像质量的同时,所需FLOPs减少10倍,生成速度提升5倍。