We present Infinity, a Bitwise Visual AutoRegressive Modeling capable of generating high-resolution, photorealistic images following language instruction. Infinity redefines visual autoregressive model under a bitwise token prediction framework with an infinite-vocabulary tokenizer & classifier and bitwise self-correction mechanism, remarkably improving the generation capacity and details. By theoretically scaling the tokenizer vocabulary size to infinity and concurrently scaling the transformer size, our method significantly unleashes powerful scaling capabilities compared to vanilla VAR. Infinity sets a new record for autoregressive text-to-image models, outperforming top-tier diffusion models like SD3-Medium and SDXL. Notably, Infinity surpasses SD3-Medium by improving the GenEval benchmark score from 0.62 to 0.73 and the ImageReward benchmark score from 0.87 to 0.96, achieving a win rate of 66%. Without extra optimization, Infinity generates a high-quality 1024x1024 image in 0.8 seconds, making it 2.6x faster than SD3-Medium and establishing it as the fastest text-to-image model. Models and codes will be released to promote further exploration of Infinity for visual generation and unified tokenizer modeling.
翻译:本文提出Infinity,一种能够根据语言指令生成高分辨率、逼真图像的比特级视觉自回归模型。Infinity通过无限词汇分词器与分类器及比特级自校正机制,在比特级标记预测框架下重构了视觉自回归模型,显著提升了生成能力与细节表现。通过理论上将分词器词汇规模扩展至无限大并同步扩展Transformer规模,本方法相较于传统VAR模型显著释放了强大的扩展潜力。Infinity为自回归文生图模型创造了新纪录,其性能超越SD3-Medium、SDXL等顶尖扩散模型。值得注意的是,Infinity将GenEval基准分数从0.62提升至0.73,ImageReward基准分数从0.87提升至0.96,并以66%的胜率超越SD3-Medium。在无需额外优化的情况下,Infinity仅需0.8秒即可生成1024x1024高质量图像,速度达到SD3-Medium的2.6倍,成为当前最快的文生图模型。我们将公开模型与代码,以促进Infinity在视觉生成与统一分词器建模领域的进一步探索。