Existing vector quantization (VQ) methods struggle with scalability, largely attributed to the instability of the codebook that undergoes partial updates during training. The codebook is prone to collapse as utilization decreases, due to the progressively widening distribution gap between non-activated codes and visual features. To solve the problem, we propose Index Backpropagation Quantization (IBQ), a new VQ method for the joint optimization of all codebook embeddings and the visual encoder. Applying a straight-through estimator on the one-hot categorical distribution between the encoded feature and codebook, all codes are differentiable and maintain a consistent latent space with the visual encoder. IBQ enables scalable training of visual tokenizers and, for the first time, achieves a large-scale codebook ($2^{18}$) with high dimension ($256$) and high utilization. Experiments on the standard ImageNet benchmark demonstrate the scalability and superiority of IBQ, achieving competitive results on reconstruction and the application of autoregressive visual generation. The code and models are available at https://github.com/TencentARC/SEED-Voken.
翻译:现有的向量量化方法在可扩展性方面存在困难,这主要归因于训练过程中部分更新的码本不稳定性。由于未激活码字与视觉特征之间的分布差距逐渐扩大,码本随着利用率下降而容易崩溃。为解决该问题,我们提出了索引反向传播量化方法,这是一种用于联合优化所有码本嵌入和视觉编码器的新型向量量化方法。通过在编码特征与码本之间的独热分类分布上应用直通估计器,所有码字均具备可微分性,并与视觉编码器保持一致的潜在空间。该方法实现了视觉标记器的可扩展训练,并首次实现了大规模码本($2^{18}$)、高维度($256$)与高利用率的统一。在标准ImageNet基准上的实验验证了该方法的可扩展性与优越性,在图像重建和自回归视觉生成应用方面均取得了具有竞争力的结果。代码与模型已发布于https://github.com/TencentARC/SEED-Voken。