We present BitDance, a scalable autoregressive (AR) image generator that predicts binary visual tokens instead of codebook indices. With high-entropy binary latents, BitDance lets each token represent up to $2^{256}$ states, yielding a compact yet highly expressive discrete representation. Sampling from such a huge token space is difficult with standard classification. To resolve this, BitDance uses a binary diffusion head: instead of predicting an index with softmax, it employs continuous-space diffusion to generate the binary tokens. Furthermore, we propose next-patch diffusion, a new decoding method that predicts multiple tokens in parallel with high accuracy, greatly speeding up inference. On ImageNet 256x256, BitDance achieves an FID of 1.24, the best among AR models. With next-patch diffusion, BitDance beats state-of-the-art parallel AR models that use 1.4B parameters, while using 5.4x fewer parameters (260M) and achieving 8.7x speedup. For text-to-image generation, BitDance trains on large-scale multimodal tokens and generates high-resolution, photorealistic images efficiently, showing strong performance and favorable scaling. When generating 1024x1024 images, BitDance achieves a speedup of over 30x compared to prior AR models. We release code and models to facilitate further research on AR foundation models. Code and models are available at: https://github.com/shallowdream204/BitDance.
翻译:本文提出BitDance,一种可扩展的自回归(AR)图像生成器,其预测对象为二元视觉标记而非码本索引。通过采用高熵二元潜变量,BitDance使每个标记最多可表示$2^{256}$种状态,从而形成紧凑且表达能力极强的离散表示。在此巨大标记空间中进行采样难以通过标准分类方法实现。为此,BitDance采用二元扩散头:它不再通过softmax预测索引,而是利用连续空间扩散来生成二元标记。此外,我们提出下一图像块扩散这一新型解码方法,能够以高精度并行预测多个标记,显著加速推理过程。在ImageNet 256x256数据集上,BitDance取得了1.24的FID分数,在所有AR模型中达到最优性能。结合下一图像块扩散技术,BitDance在仅使用2.6亿参数(减少5.4倍)的情况下,超越了当前使用14亿参数的并行AR模型,并实现8.7倍的推理加速。在文本到图像生成任务中,BitDance基于大规模多模态标记进行训练,能够高效生成高分辨率、逼真的图像,展现出卓越的性能与良好的可扩展性。在生成1024x1024图像时,BitDance相比现有AR模型实现超过30倍的加速。我们公开了代码与模型以促进AR基础模型的后续研究。代码与模型发布于:https://github.com/shallowdream204/BitDance。