FlexVAR：无需残差预测的灵活视觉自回归建模 (FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction)

This work challenges the residual prediction paradigm in visual autoregressive modeling and presents FlexVAR, a new Flexible Visual AutoRegressive image generation paradigm. FlexVAR facilitates autoregressive learning with ground-truth prediction, enabling each step to independently produce plausible images. This simple, intuitive approach swiftly learns visual distributions and makes the generation process more flexible and adaptable. Trained solely on low-resolution images ($\leq$ 256px), FlexVAR can: (1) Generate images of various resolutions and aspect ratios, even exceeding the resolution of the training images. (2) Support various image-to-image tasks, including image refinement, in/out-painting, and image expansion. (3) Adapt to various autoregressive steps, allowing for faster inference with fewer steps or enhancing image quality with more steps. Our 1.0B model outperforms its VAR counterpart on the ImageNet 256$\times$256 benchmark. Moreover, when zero-shot transfer the image generation process with 13 steps, the performance further improves to 2.08 FID, outperforming state-of-the-art autoregressive models AiM/VAR by 0.25/0.28 FID and popular diffusion models LDM/DiT by 1.52/0.19 FID, respectively. When transferring our 1.0B model to the ImageNet 512$\times$512 benchmark in a zero-shot manner, FlexVAR achieves competitive results compared to the VAR 2.3B model, which is a fully supervised model trained at 512$\times$512 resolution.

翻译：本研究挑战了视觉自回归建模中的残差预测范式，提出了FlexVAR——一种新的灵活视觉自回归图像生成范式。FlexVAR通过真实值预测促进自回归学习，使每一步都能独立生成合理的图像。这种简单直观的方法能快速学习视觉分布，并使生成过程更具灵活性和适应性。仅使用低分辨率图像（≤256像素）训练的FlexVAR能够：（1）生成多种分辨率和宽高比的图像，甚至超越训练图像的分辨率。（2）支持多种图像到图像任务，包括图像精细化、内外修复以及图像扩展。（3）适应不同的自回归步数，可通过较少步数实现更快推理，或通过更多步数提升图像质量。我们的10亿参数模型在ImageNet 256×256基准测试中超越了对应的VAR模型。此外，当以13步进行零样本迁移图像生成时，性能进一步提升至2.08 FID，分别以0.25/0.28 FID的优势超越最先进的自回归模型AiM/VAR，并以1.52/0.19 FID的优势超越流行的扩散模型LDM/DiT。当将我们的10亿参数模型以零样本方式迁移至ImageNet 512×512基准测试时，FlexVAR取得了与23亿参数VAR模型相当的结果，而后者是在512×512分辨率下进行全监督训练的模型。