Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization

Modern speaker verification (SV) systems typically demand expensive storage and computing resources, thereby hindering their deployment on mobile devices. In this paper, we explore adaptive neural network quantization for lightweight speaker verification. Firstly, we propose a novel adaptive uniform precision quantization method which enables the dynamic generation of quantization centroids customized for each network layer based on k-means clustering. By applying it to the pre-trained SV systems, we obtain a series of quantized variants with different bit widths. To enhance the performance of low-bit quantized models, a mixed precision quantization algorithm along with a multi-stage fine-tuning (MSFT) strategy is further introduced. Unlike uniform precision quantization, mixed precision approach allows for the assignment of varying bit widths to different network layers. When bit combination is determined, MSFT is employed to progressively quantize and fine-tune network in a specific order. Finally, we design two distinct binary quantization schemes to mitigate performance degradation of 1-bit quantized models: the static and adaptive quantizers. Experiments on VoxCeleb demonstrate that lossless 4-bit uniform precision quantization is achieved on both ResNets and DF-ResNets, yielding a promising compression ratio of around 8. Moreover, compared to uniform precision approach, mixed precision quantization not only obtains additional performance improvements with a similar model size but also offers the flexibility to generate bit combination for any desirable model size. In addition, our suggested 1-bit quantization schemes remarkably boost the performance of binarized models. Finally, a thorough comparison with existing lightweight SV systems reveals that our proposed models outperform all previous methods by a large margin across various model size ranges.

翻译：现代说话人验证系统通常需要昂贵的存储和计算资源，这阻碍了其在移动设备上的部署。本文探索了面向轻量级说话人验证的自适应神经网络量化技术。首先，我们提出了一种新颖的自适应均匀精度量化方法，该方法能够基于k-means聚类为每个网络层动态生成定制化的量化中心点。通过将其应用于预训练的说话人验证系统，我们获得了一系列具有不同位宽的量化变体。为提升低位宽量化模型的性能，我们进一步引入了混合精度量化算法及多阶段微调策略。与均匀精度量化不同，混合精度方法允许为不同网络层分配不同的位宽。当位宽组合确定后，多阶段微调策略将按特定顺序逐步量化和微调网络。最后，我们设计了两种不同的二值量化方案以缓解1位量化模型的性能下降：静态量化器与自适应量化器。在VoxCeleb数据集上的实验表明，ResNet和DF-ResNet均实现了无损的4位均匀精度量化，获得了约8倍的理想压缩率。此外，与均匀精度方法相比，混合精度量化不仅能在相似模型尺寸下获得额外的性能提升，还能灵活生成任意目标模型尺寸的位宽组合。我们提出的1位量化方案也显著提升了二值化模型的性能。最终，与现有轻量级说话人验证系统的全面对比表明，我们提出的模型在不同尺寸范围内均大幅优于所有先前方法。