Vector quantization is central to modern generative modeling pipelines, but large-codebook VQ models often suffer from codebook collapse. We identify encoder drift as a key driver of this failure: as the encoder moves the latent distribution, sparsely updated code vectors can lag behind, lose assignments, and increase quantization error, creating a feedback loop through the straight-through estimator. We propose NSVQ, a non-stationary-aware VQ training strategy that combines a dense non-stationary embedding loss, codebook replacement, and stage-wise encoder freezing. NSVQ first helps the codebook track encoder drift during early training, then freezes the encoder to consolidate the codebook under a fixed latent geometry, and finally reintroduces adversarial refinement. Experiments on ImageNet-1k show that NSVQ improves reconstruction quality while maintaining full codebook utilization. On ImageNet-1k at 128$\times$128 with 65,536 codes, NSVQ reduces rFID from 2.39 to 2.10 compared with SimVQ, while both methods maintain 100\% utilization. Additional latent diffusion experiments show that NSVQ also improves downstream ImageNet generation FID.
翻译:向量量化是现代生成建模流程的核心,但大型码本VQ模型常遭受码本坍塌问题。我们识别出编码器漂移是这一失效的关键驱动因素:当编码器移动潜在分布时,稀疏更新的码向量可能滞后、丢失分配并增加量化误差,通过直通估计器形成反馈循环。我们提出NSVQ,一种非平稳感知的VQ训练策略,结合了密集非平稳嵌入损失、码本替换和分阶段编码器冻结。NSVQ首先在早期训练中帮助码本跟踪编码器漂移,然后冻结编码器以在固定潜在几何结构下巩固码本,最后重新引入对抗性细化。在ImageNet-1k上的实验表明,NSVQ在保持全码本利用率的同时提升了重建质量。在128×128分辨率、65536个码字的ImageNet-1k上,相比SimVQ,NSVQ将rFID从2.39降至2.10,且两种方法均保持100%利用率。额外的潜在扩散实验显示,NSVQ还改善了下游ImageNet生成任务的FID。