We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.
翻译:我们提出NextFlow,一个在6万亿交错文本-图像离散标记上训练的统一仅解码器自回归Transformer。通过在统一自回归架构中利用统一的视觉表示,NextFlow原生激活了多模态理解与生成能力,解锁了图像编辑、交错内容生成和视频生成等功能。受模态本质差异的启发——文本严格遵循序列性而图像本质上是层次化的——我们保留文本的下一标记预测,但对视觉生成采用下一尺度预测。这有别于传统的栅格扫描方法,使得生成1024x1024分辨率图像仅需5秒,比同类自回归模型快数个数量级。我们通过稳健的训练方案解决了多尺度生成的不稳定性问题。此外,我们引入了一种用于强化学习的前缀调优策略。实验表明,NextFlow在统一模型中实现了最先进的性能,并在视觉质量上可与专用扩散基线模型相媲美。