Autoregressive (AR) image generators offer a language-model-friendly approach to image generation by predicting discrete image tokens in a causal sequence. However, unlike diffusion models, AR models lack a mechanism to refine previous predictions, limiting their generation quality. In this paper, we introduce TensorAR, a new AR paradigm that reformulates image generation from next-token prediction to next-tensor prediction. By generating overlapping windows of image patches (tensors) in a sliding fashion, TensorAR enables iterative refinement of previously generated content. To prevent information leakage during training, we propose a discrete tensor noising scheme, which perturbs input tokens via codebook-indexed noise. TensorAR is implemented as a plug-and-play module compatible with existing AR models. Extensive experiments on LlamaGEN, Open-MAGVIT2, and RAR demonstrate that TensorAR significantly improves the generation performance of autoregressive models.
翻译:自回归(AR)图像生成器通过以因果序列预测离散图像标记,提供了一种对语言模型友好的图像生成方法。然而,与扩散模型不同,AR模型缺乏对先前预测进行精细化的机制,这限制了其生成质量。本文提出了TensorAR,一种新的AR范式,它将图像生成从下一个标记预测重新定义为下一个张量预测。通过以滑动方式生成重叠的图像块(张量)窗口,TensorAR能够对先前生成的内容进行迭代精细化。为了防止训练期间的信息泄露,我们提出了一种离散张量加噪方案,该方案通过基于码书的索引噪声来扰动输入标记。TensorAR被实现为一个即插即用模块,与现有的AR模型兼容。在LlamaGEN、Open-MAGVIT2和RAR上进行的大量实验表明,TensorAR显著提升了自回归模型的生成性能。