Vector quantization(VQ) is a hardware-friendly DNN compression method that can reduce the storage cost and weight-loading datawidth of hardware accelerators. However, conventional VQ techniques lead to significant accuracy loss because the important weights are not well preserved. To tackle this problem, a novel approach called MVQ is proposed, which aims at better approximating important weights with a limited number of codewords. At the algorithm level, our approach removes the less important weights through N:M pruning and then minimizes the vector clustering error between the remaining weights and codewords by the masked k-means algorithm. Only distances between the unpruned weights and the codewords are computed, which are then used to update the codewords. At the architecture level, our accelerator implements vector quantization on an EWS (Enhanced weight stationary) CNN accelerator and proposes a sparse systolic array design to maximize the benefits brought by masked vector quantization.\\ Our algorithm is validated on various models for image classification, object detection, and segmentation tasks. Experimental results demonstrate that MVQ not only outperforms conventional vector quantization methods at comparable compression ratios but also reduces FLOPs. Under ASIC evaluation, our MVQ accelerator boosts energy efficiency by 2.3$\times$ and reduces the size of the systolic array by 55\% when compared with the base EWS accelerator. Compared to the previous sparse accelerators, MVQ achieves 1.73$\times$ higher energy efficiency.
翻译:向量量化(VQ)是一种硬件友好的深度神经网络压缩方法,能够降低硬件加速器的存储开销和权重加载数据位宽。然而,传统VQ技术由于未能充分保留重要权重,会导致显著的精度损失。为解决此问题,本文提出一种称为MVQ的新方法,其目标是在有限码本数量下更好地逼近重要权重。在算法层面,本方法通过N:M剪枝去除次要权重,随后采用掩码k均值算法最小化剩余权重与码本之间的向量聚类误差。仅计算未剪枝权重与码本之间的距离,并据此更新码本。在架构层面,本加速器在增强权重驻留式CNN加速器上实现向量量化,并提出稀疏脉动阵列设计以最大化掩码向量量化带来的优势。\\ 本算法在图像分类、目标检测和分割任务中的多种模型上得到验证。实验结果表明,MVQ在可比压缩率下不仅优于传统向量量化方法,同时减少了浮点运算量。在ASIC评估中,相较于基础增强权重驻留式加速器,MVQ加速器能效提升2.3倍,脉动阵列规模减少55%。与现有稀疏加速器相比,MVQ实现了1.73倍的能效提升。