Recent advances in autoregressive (AR) models with continuous tokens for image generation show promising results by eliminating the need for discrete tokenization. However, these models face efficiency challenges due to their sequential token generation nature and reliance on computationally intensive diffusion-based sampling. We present ECAR (Efficient Continuous Auto-Regressive Image Generation via Multistage Modeling), an approach that addresses these limitations through two intertwined innovations: (1) a stage-wise continuous token generation strategy that reduces computational complexity and provides progressively refined token maps as hierarchical conditions, and (2) a multistage flow-based distribution modeling method that transforms only partial-denoised distributions at each stage comparing to complete denoising in normal diffusion models. Holistically, ECAR operates by generating tokens at increasing resolutions while simultaneously denoising the image at each stage. This design not only reduces token-to-image transformation cost by a factor of the stage number but also enables parallel processing at the token level. Our approach not only enhances computational efficiency but also aligns naturally with image generation principles by operating in continuous token space and following a hierarchical generation process from coarse to fine details. Experimental results demonstrate that ECAR achieves comparable image quality to DiT Peebles & Xie [2023] while requiring 10$\times$ FLOPs reduction and 5$\times$ speedup to generate a 256$\times$256 image.
翻译:近年来,基于连续令牌的自回归模型在图像生成领域取得了显著进展,通过消除离散令牌化的需求展现出巨大潜力。然而,由于这些模型采用顺序令牌生成机制且依赖计算密集的基于扩散的采样方法,其效率面临严峻挑战。本文提出ECAR(通过多阶段建模实现高效连续自回归图像生成),该方法通过两项相互关联的创新突破这些限制:(1)采用分阶段连续令牌生成策略,在降低计算复杂度的同时提供渐进细化的令牌图作为分层条件;(2)提出多阶段基于流的分布建模方法,与常规扩散模型的完全去噪过程相比,该方法仅在各阶段对部分去噪分布进行变换。整体而言,ECAR通过在逐级提高分辨率生成令牌的同时,于每个阶段对图像进行去噪操作。该设计不仅将令牌到图像的转换成本降低为阶段数的倒数,还实现了令牌层级的并行处理。我们的方法不仅显著提升了计算效率,而且通过在连续令牌空间中操作并遵循从粗粒度到细粒度的分层生成过程,自然契合图像生成的基本原理。实验结果表明,ECAR在生成256×256图像时,在实现与DiT Peebles & Xie [2023]相当图像质量的同时,所需FLOPs减少10倍,生成速度提升5倍。