Vector quantization is a technique in machine learning that discretizes continuous representations into a set of discrete vectors. It is widely employed in tokenizing data representations for large language models, diffusion models, and other generative models. Despite its prevalence, the characteristics and behaviors of vector quantization in generative models remain largely underexplored. In this study, we investigate representation collapse in vector quantization - a critical degradation where codebook tokens or latent embeddings lose their discriminative power by converging to a limited subset of values. This collapse fundamentally compromises the model's ability to capture diverse data patterns. By leveraging both synthetic and real datasets, we identify the severity of each type of collapses and triggering conditions. Our analysis reveals that restricted initialization and limited encoder capacity result in tokens collapse and embeddings collapse. Building on these findings, we propose potential solutions aimed at mitigating each collapse. To the best of our knowledge, this is the first comprehensive study examining representation collapsing problems in vector quantization.
翻译:向量量化是机器学习中的一种技术,它将连续表征离散化为一组离散向量。该技术广泛应用于大型语言模型、扩散模型及其他生成模型的数据表征标记化。尽管向量量化应用广泛,但其在生成模型中的特性与行为仍未得到充分探索。本研究探究向量量化中的表征坍缩现象——这是一种关键的性能退化问题,表现为码本标记或潜在嵌入因收敛至有限值子集而丧失判别能力。这种坍缩从根本上损害了模型捕捉多样化数据模式的能力。通过综合使用合成数据集与真实数据集,我们量化了各类坍缩的严重程度并识别其触发条件。分析表明:受限的初始化策略与有限的编码器容量会导致标记坍缩与嵌入坍缩。基于这些发现,我们提出了针对各类坍缩的潜在缓解方案。据我们所知,这是首个系统研究向量量化中表征坍缩问题的综合性工作。