Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

Autoregressive (AR) modeling, known for its next-token prediction paradigm, underpins state-of-the-art language and visual generative models. Traditionally, a ``token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision. However, the optimal token definition for 2D image structures remains an open question. Moreover, AR models suffer from exposure bias, where teacher forcing during training leads to error accumulation at inference. In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a $k\times k$ grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. Additionally, we reformulate discrete token classification as \textbf{continuous entity regression}, leveraging flow-matching methods at each AR step. This approach conditions training on noisy entities instead of ground truth tokens, leading to Noisy Context Learning, which effectively alleviates exposure bias. As a result, xAR offers two key advantages: (1) it enables flexible prediction units that capture different contextual granularity and spatial structures, and (2) it mitigates exposure bias by avoiding reliance on teacher forcing. On ImageNet-256 generation benchmark, our base model, xAR-B (172M), outperforms DiT-XL/SiT-XL (675M) while achieving 20$\times$ faster inference. Meanwhile, xAR-H sets a new state-of-the-art with an FID of 1.24, running 2.2$\times$ faster than the previous best-performing model without relying on vision foundation modules (\eg, DINOv2) or advanced guidance interval sampling.

翻译：自回归（AR）建模以其下一标记预测范式而闻名，构成了最先进的语言和视觉生成模型的基础。传统上，“标记”被视为最小的预测单元，通常是语言中的离散符号或视觉中的量化图像块。然而，对于二维图像结构的最优标记定义仍是一个悬而未决的问题。此外，AR模型存在暴露偏差问题，即训练期间的教师强制会导致推理时的误差累积。在本文中，我们提出了xAR，一个广义的自回归框架，它将标记的概念扩展为一个实体X，该实体可以表示单个图像块标记、一个单元（相邻图像块的 $k\times k$ 分组）、一个子样本（非局部远距离图像块的分组）、一个尺度（从粗到细的分辨率），甚至是一整张图像。此外，我们将离散标记分类重新表述为\textbf{连续实体回归}，在每个AR步骤中利用流匹配方法。这种方法在训练时以带噪声的实体而非真实标记为条件，从而实现了噪声上下文学习，有效缓解了暴露偏差。因此，xAR提供了两个关键优势：（1）它支持灵活的预测单元，能够捕捉不同粒度的上下文信息和空间结构；（2）它通过避免依赖教师强制来减轻暴露偏差。在ImageNet-256生成基准测试中，我们的基础模型xAR-B（172M）在实现推理速度提升20$\times$的同时，性能超越了DiT-XL/SiT-XL（675M）。与此同时，xAR-H以1.24的FID分数创造了新的最先进水平，其推理速度比之前的最佳性能模型快2.2$\times$，且不依赖于视觉基础模块（例如DINOv2）或高级引导区间采样。