In this work, we developed and tested 3 techniques for vector quantization (VQ) based model weight compression. To mitigate codebook collapse and enable end-to-end training, we adopted cosine similarity-based assignment. Building on ideas from attention-based formulations in Differentiable K-Means (DKM), we further improved this approach by using cosine similarity for assignment combined with top-1 sampling and a straight-through estimator, thereby eliminating the need for weighted-average reconstruction. Finally, we investigated the use of differentiable neural architecture search (NAS) to adaptively select layer-wise quantization configurations, further optimizing the compression process. Although our method does not consistently outperform existing approaches across all quantization levels, it provides useful insights into the design trade-offs and behaviors of VQ-based model compression methods.
翻译:在本工作中,我们开发并测试了三种基于向量量化(VQ)的模型权重压缩技术。为缓解码本崩溃并实现端到端训练,我们采用了基于余弦相似度的赋值方法。借鉴可微K均值(DKM)中基于注意力的构型思路,我们进一步改进了该方法:将余弦相似度赋值与top-1采样及直通估计器相结合,从而避免了加权平均重构的需求。最终,我们研究了利用可微神经架构搜索(NAS)自适应选择逐层量化配置的方法,进一步优化压缩过程。尽管我们的方法在所有量化水平上未能一致地超越现有方案,但它为基于VQ的模型压缩方法的设计权衡与行为特性提供了有价值的见解。