Although low-bit quantization provides practical means to deploy speaker verification on resource-constrained devices, its effects on speaker verification performance remain poorly understood. In this paper, we study uniform K-means quantization-aware training of ResNet-36 and ResNet-200 through joint layer-wise and score-level analyses. Our layer-wise analysis highlights fragile components and shows that score degradation is not fully explained by weight distortion alone. We identify a clear knee point at 2 bits, with larger score drift and harmful decision flips concentrated near the FP32 threshold. Our score-level analysis reveals where and how score errors emerge under extreme quantization. Building on these findings, we propose a calibrated multi-precision cascade that resolves most trials at 2 bits and escalates only ambiguous cases, achieving performance close to FP32 while preserving the efficiency benefits of low-bit inference with substantially lower compute and memory costs.
翻译:尽管低位量化为在资源受限设备上部署说话人识别提供了实用手段,但其对说话人识别性能的影响仍未被充分理解。本文通过分层和得分级联合分析,研究了ResNet-36和ResNet-200的均匀K-means量化感知训练。我们的分层分析揭示了脆弱组件,并表明得分退化不能仅由权重失真完全解释。我们识别出2比特处存在明显的拐点,较大的得分偏移和有害决策翻转集中在FP32阈值附近。得分级分析揭示了极端量化下得分误差的产生位置和方式。基于这些发现,我们提出了一种校准的多精度串联方案,该方案在2比特下解决大多数样本,仅将模糊案例升级处理,在保持低位推理效率优势的同时,以显著更低的计算和存储成本实现了接近FP32的性能。