Autoregressive models have achieved significant success in image generation. However, unlike the inherent hierarchical structure of image information in the spectral domain, standard autoregressive methods typically generate pixels sequentially in a fixed spatial order. To better leverage this spectral hierarchy, we introduce NextFrequency Image Generation (NFIG). NFIG is a novel framework that decomposes the image generation process into multiple frequency-guided stages. NFIG aligns the generation process with the natural image structure. It does this by first generating low-frequency components, which efficiently capture global structure with significantly fewer tokens, and then progressively adding higher-frequency details. This frequency-aware paradigm offers substantial advantages: it not only improves the quality of generated images but crucially reduces inference cost by efficiently establishing global structure early on. Extensive experiments on the ImageNet-256 benchmark validate NFIG's effectiveness, demonstrating superior performance (FID: 2.81) and a notable 1.25x speedup compared to the strong baseline VAR-d20. The source code is available at https://github.com/Pride-Huang/NFIG.
翻译:自回归模型在图像生成领域已取得显著成功。然而,与谱域中图像信息固有的层次化结构不同,标准的自回归方法通常以固定的空间顺序逐像素生成图像。为了更好地利用这种谱域层次结构,我们提出了基于下一频率的图像生成方法NFIG。NFIG是一种新颖的框架,它将图像生成过程分解为多个频率引导的阶段。该框架通过首先生成低频分量来高效捕获全局结构,随后逐步添加更高频的细节,从而使生成过程与图像的自然结构对齐。这种频率感知范式具有显著优势:它不仅提升了生成图像的质量,更重要的是通过在早期高效建立全局结构,显著降低了推理成本。在ImageNet-256基准上进行的大量实验验证了NFIG的有效性,其表现出优越的性能(FID:2.81),与强基线VAR-d20相比实现了1.25倍的显著加速。源代码可在https://github.com/Pride-Huang/NFIG获取。