LGQ: Learning Discretization Geometry for Scalable and Stable Image Tokenization

Discrete image tokenization is a key bottleneck for scalable visual generation: a tokenizer must remain compact for efficient latent-space priors while preserving semantic structure and using discrete capacity effectively. Existing quantizers face a trade-off: vector-quantized tokenizers learn flexible geometries but often suffer from biased straight-through optimization, codebook under-utilization, and representation collapse at large vocabularies. Structured scalar or implicit tokenizers ensure stable, near-complete utilization by design, yet rely on fixed discretization geometries that may allocate capacity inefficiently under heterogeneous latent statistics. We introduce Learnable Geometric Quantization (LGQ), a discrete image tokenizer that learns discretization geometry end-to-end. LGQ replaces hard nearest-neighbor lookup with temperature-controlled soft assignments, enabling fully differentiable training while recovering hard assignments at inference. The assignments correspond to posterior responsibilities of an isotropic Gaussian mixture and minimize a variational free-energy objective, provably converging to nearest-neighbor quantization in the low-temperature limit. LGQ combines a token-level peakedness regularizer with a global usage regularizer to encourage confident yet balanced code utilization without imposing rigid grids. Under a controlled VQGAN-style backbone on ImageNet across multiple vocabulary sizes, LGQ achieves stable optimization and balanced utilization. At 16K codebook size, LGQ improves rFID by 11.88% over FSQ while using 49.96% fewer active codes, and improves rFID by 6.06% over SimVQ with 49.45% lower effective representation rate, achieving comparable fidelity with substantially fewer active entries. Our GitHub repository is available at: https://github.com/KurbanIntelligenceLab/LGQ

翻译：离散图像标记化是可扩展视觉生成的关键瓶颈：标记器必须保持紧凑以实现高效的潜在空间先验，同时保留语义结构并有效利用离散容量。现有量化器面临权衡：向量量化标记器学习灵活的几何结构，但常受有偏直通优化、码本利用不足和大词汇量下表示坍缩的困扰。结构化标量或隐式标记器通过设计确保稳定、近乎完全的利用率，但依赖于固定的离散化几何结构，在异构潜在统计下可能低效分配容量。我们提出可学习几何量化（LGQ），一种端到端学习离散化几何的离散图像标记器。LGQ用温度控制的软分配替代硬最近邻查找，实现完全可微分训练，同时在推理时恢复硬分配。该分配对应各向同性高斯混合的后验责任，并最小化变分自由能目标，在低温极限下可证明收敛至最近邻量化。LGQ结合标记级峰值正则化器与全局使用正则化器，鼓励自信且平衡的码字利用，而不强加刚性网格。在ImageNet数据集上使用受控的VQGAN风格主干网络进行多词汇量测试，LGQ实现了稳定优化和平衡利用。在16K码本规模下，LGQ相比FSQ将rFID提升11.88%，同时减少49.96%的活跃码字；相比SimVQ提升rFID 6.06%，有效表示率降低49.45%，以显著更少的活跃条目实现可比拟的保真度。我们的GitHub仓库位于：https://github.com/KurbanIntelligenceLab/LGQ