This paper challenges the dominance of continuous pipelines in visual generation. We systematically investigate the performance gap between discrete and continuous methods. Contrary to the belief that discrete tokenizers are intrinsically inferior, we demonstrate that the disparity arises primarily from the total number of bits allocated in the latent space (i.e., the compression ratio). We show that scaling up the codebook size effectively bridges this gap, allowing discrete tokenizers to match or surpass their continuous counterparts. However, existing discrete generation methods struggle to capitalize on this insight, suffering from performance degradation or prohibitive training costs with scaled codebook. To address this, we propose masked Bit AutoRegressive modeling (BAR), a scalable framework that supports arbitrary codebook sizes. By equipping an autoregressive transformer with a masked bit modeling head, BAR predicts discrete tokens through progressively generating their constituent bits. BAR achieves a new state-of-the-art gFID of 0.99 on ImageNet-256, outperforming leading methods across both continuous and discrete paradigms, while significantly reducing sampling costs and converging faster than prior continuous approaches. Project page is available at https://bar-gen.github.io/
翻译:本文挑战了连续流程在视觉生成领域的主导地位。我们系统性地研究了离散方法与连续方法之间的性能差距。与认为离散分词器本质上较差的普遍观点相反,我们证明这种差距主要源于潜在空间中分配的总比特数(即压缩比)。我们发现,扩大码本规模能有效弥合这一差距,使离散分词器能够匹配甚至超越其连续对应方法。然而,现有的离散生成方法难以利用这一洞见,在码本规模扩大时面临性能下降或训练成本过高的问题。为解决此问题,我们提出了掩码位自回归建模(BAR),这是一个支持任意码本规模的可扩展框架。通过为自回归Transformer配备掩码位建模头,BAR通过逐步生成离散令牌的组成比特来预测这些令牌。BAR在ImageNet-256上实现了0.99的gFID新最优结果,超越了连续和离散范式中的主流方法,同时显著降低了采样成本,并且比先前的连续方法收敛更快。项目页面见 https://bar-gen.github.io/