The field of image generation is currently bifurcated into autoregressive (AR) models operating on discrete tokens and diffusion models utilizing continuous latents. This divide, rooted in the distinction between VQ-VAEs and VAEs, hinders unified modeling and fair benchmarking. Finite Scalar Quantization (FSQ) offers a theoretical bridge, yet vanilla FSQ suffers from a critical flaw: its equal-interval quantization can cause activation collapse. This mismatch forces a trade-off between reconstruction fidelity and information efficiency. In this work, we resolve this dilemma by simply replacing the activation function in original FSQ with a distribution-matching mapping to enforce a uniform prior. Termed iFSQ, this simple strategy requires just one line of code yet mathematically guarantees both optimal bin utilization and reconstruction precision. Leveraging iFSQ as a controlled benchmark, we uncover two key insights: (1) The optimal equilibrium between discrete and continuous representations lies at approximately 4 bits per dimension. (2) Under identical reconstruction constraints, AR models exhibit rapid initial convergence, whereas diffusion models achieve a superior performance ceiling, suggesting that strict sequential ordering may limit the upper bounds of generation quality. Finally, we extend our analysis by adapting Representation Alignment (REPA) to AR models, yielding LlamaGen-REPA. Codes is available at https://github.com/Tencent-Hunyuan/iFSQ
翻译:当前图像生成领域分为两大阵营:基于离散令牌的自回归(AR)模型和利用连续隐变量的扩散模型。这一分野根植于VQ-VAE与VAE之间的本质差异,阻碍了统一建模与公平基准测试。有限标量量化(FSQ)提供了理论桥梁,但原始FSQ存在关键缺陷:其等间隔量化机制可能导致激活崩溃。这种不匹配迫使模型在重建保真度与信息效率之间进行权衡。本研究通过简单替换原始FSQ中的激活函数为分布匹配映射以强制均匀先验,从而解决了这一困境。该方法命名为iFSQ,仅需一行代码即可在数学上同时保证最优的量化箱利用率和重建精度。借助iFSQ作为受控基准,我们揭示出两个关键发现:(1)离散与连续表示之间的最优平衡点约为每维度4比特;(2)在相同重建约束下,AR模型呈现快速初始收敛,而扩散模型能达到更优的性能上限,这表明严格的序列排序可能限制生成质量的理论极限。最后,我们通过将表示对齐(REPA)技术适配至AR模型拓展了分析框架,由此构建的LlamaGen-REPA验证了该方法的有效性。代码已开源:https://github.com/Tencent-Hunyuan/iFSQ