Conventional wisdom holds that autoregressive models for image generation are typically accompanied by vector-quantized tokens. We observe that while a discrete-valued space can facilitate representing a categorical distribution, it is not a necessity for autoregressive modeling. In this work, we propose to model the per-token probability distribution using a diffusion procedure, which allows us to apply autoregressive models in a continuous-valued space. Rather than using categorical cross-entropy loss, we define a Diffusion Loss function to model the per-token probability. This approach eliminates the need for discrete-valued tokenizers. We evaluate its effectiveness across a wide range of cases, including standard autoregressive models and generalized masked autoregressive (MAR) variants. By removing vector quantization, our image generator achieves strong results while enjoying the speed advantage of sequence modeling. We hope this work will motivate the use of autoregressive generation in other continuous-valued domains and applications. Code is available at: https://github.com/LTH14/mar.
翻译:传统观点认为,图像生成的自回归模型通常需要搭配向量量化的标记。我们观察到,虽然离散值空间有助于表示分类分布,但这并非自回归建模的必要条件。在本工作中,我们提出使用扩散过程对每个标记的概率分布进行建模,从而允许我们在连续值空间中应用自回归模型。我们不再使用分类交叉熵损失,而是定义了一个扩散损失函数来建模每个标记的概率。这种方法消除了对离散值标记器的需求。我们在多种场景中评估了其有效性,包括标准自回归模型和广义掩码自回归(MAR)变体。通过移除向量量化,我们的图像生成器在保持序列建模速度优势的同时取得了优异的结果。我们希望这项工作能够推动自回归生成方法在其他连续值领域和应用中的使用。代码发布于:https://github.com/LTH14/mar。