Vector-quantized image modeling has shown great potential in synthesizing high-quality images. However, generating high-resolution images remains a challenging task due to the quadratic computational overhead of the self-attention process. In this study, we seek to explore a more efficient two-stage framework for high-resolution image generation with improvements in the following three aspects. (1) Based on the observation that the first quantization stage has solid local property, we employ a local attention-based quantization model instead of the global attention mechanism used in previous methods, leading to better efficiency and reconstruction quality. (2) We emphasize the importance of multi-grained feature interaction during image generation and introduce an efficient attention mechanism that combines global attention (long-range semantic consistency within the whole image) and local attention (fined-grained details). This approach results in faster generation speed, higher generation fidelity, and improved resolution. (3) We propose a new generation pipeline incorporating autoencoding training and autoregressive generation strategy, demonstrating a better paradigm for image synthesis. Extensive experiments demonstrate the superiority of our approach in high-quality and high-resolution image reconstruction and generation.
翻译:矢量量化图像建模在合成高质量图像方面展现出巨大潜力。然而,由于自注意力过程的二次计算开销,生成高分辨率图像仍是一项具有挑战性的任务。在本研究中,我们致力于探索一个更高效的两阶段高分辨率图像生成框架,并在以下三个方面进行了改进。(1)基于第一量化阶段具有较强局部特性的观察,我们采用基于局部注意力的量化模型替代先前方法中使用的全局注意力机制,从而提升了效率和重建质量。(2)我们强调了图像生成过程中多粒度特征交互的重要性,并引入了一种结合全局注意力(确保整幅图像的长程语义一致性)与局部注意力(捕捉精细纹理细节)的高效注意力机制。该方法实现了更快的生成速度、更高的生成保真度以及更优的分辨率。(3)我们提出了一种结合自编码训练与自回归生成策略的新型生成流水线,展示了图像合成的更优范式。大量实验证明了我们的方法在高质量、高分辨率图像重建与生成方面的优越性。