Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

Vector Quantization (VQ) is a widely used method for converting continuous representations into discrete codes, which has become fundamental in unsupervised representation learning and latent generative models. However, VQ models are often hindered by the problem of representation collapse in the latent space, which leads to low codebook utilization and limits the scalability of the codebook for large-scale training. Existing methods designed to mitigate representation collapse typically reduce the dimensionality of latent space at the expense of model capacity, which do not fully resolve the core issue. In this study, we conduct a theoretical analysis of representation collapse in VQ models and identify its primary cause as the disjoint optimization of the codebook, where only a small subset of code vectors are updated through gradient descent. To address this issue, we propose \textbf{SimVQ}, a novel method which reparameterizes the code vectors through a linear transformation layer based on a learnable latent basis. This transformation optimizes the \textit{entire linear space} spanned by the codebook, rather than merely updating \textit{the code vector} selected by the nearest-neighbor search in vanilla VQ models. Although it is commonly understood that the multiplication of two linear matrices is equivalent to applying a single linear layer, our approach works surprisingly well in resolving the collapse issue in VQ models with just one linear layer. We validate the efficacy of SimVQ through extensive experiments across various modalities, including image and audio data with different model architectures. Our code is available at \url{https://github.com/youngsheen/SimVQ}.

翻译：向量量化（VQ）是一种将连续表征转换为离散编码的常用方法，已成为无监督表征学习和潜在生成模型的基础技术。然而，VQ模型常受潜在空间中表征坍缩问题的困扰，导致码本利用率低下，并限制了码本在大规模训练中的可扩展性。现有旨在缓解表征坍缩的方法通常以牺牲模型容量为代价降低潜在空间的维度，未能从根本上解决核心问题。本研究对VQ模型中的表征坍缩进行了理论分析，发现其主要原因在于码本的割裂式优化——仅有少量码向量通过梯度下降进行更新。为解决这一问题，我们提出了一种新方法 \textbf{SimVQ}，该方法通过基于可学习潜在基的线性变换层对码向量进行重参数化。该变换优化的是码本所张成的\textit{整个线性空间}，而非仅更新传统VQ模型中通过最近邻搜索选中的\textit{单个码向量}。尽管通常认为两个线性矩阵相乘等价于应用单个线性层，但我们的方法仅使用一个线性层就能在解决VQ模型坍缩问题上取得显著效果。我们通过在图像和音频等多种模态及不同模型架构上的大量实验验证了SimVQ的有效性。代码已发布于 \url{https://github.com/youngsheen/SimVQ}。