End-to-end automatic speech recognition (ASR) models have seen revolutionary quality gains with the recent development of large-scale universal speech models (USM). However, deploying these massive USMs is extremely expensive due to the enormous memory usage and computational cost. Therefore, model compression is an important research topic to fit USM-based ASR under budget in real-world scenarios. In this study, we propose a USM fine-tuning approach for ASR, with a low-bit quantization and N:M structured sparsity aware paradigm on the model weights, reducing the model complexity from parameter precision and matrix topology perspectives. We conducted extensive experiments with a 2-billion parameter USM on a large-scale voice search dataset to evaluate our proposed method. A series of ablation studies validate the effectiveness of up to int4 quantization and 2:4 sparsity. However, a single compression technique fails to recover the performance well under extreme setups including int2 quantization and 1:4 sparsity. By contrast, our proposed method can compress the model to have 9.4% of the size, at the cost of only 7.3% relative word error rate (WER) regressions. We also provided in-depth analyses on the results and discussions on the limitations and potential solutions, which would be valuable for future studies.
翻译:端到端自动语音识别(ASR)模型随着大规模通用语音模型(USM)的最新发展取得了革命性的质量提升。然而,由于巨大的内存占用和计算成本,部署这些庞大USM模型的代价极其高昂。因此,模型压缩是在实际场景中预算有限情况下部署基于USM的ASR系统的重要研究方向。本研究提出一种面向ASR的USM微调方法,该方法采用低比特量化和N:M结构化稀疏感知范式对模型权重进行优化,从参数精度和矩阵拓扑两个维度降低模型复杂度。我们在包含20亿参数的USM模型上,基于大规模语音搜索数据集开展了广泛实验以评估所提方法。一系列消融实验验证了int4量化和2:4稀疏性的有效性。然而,在int2量化和1:4稀疏性等极端设置下,单一压缩技术难以恢复模型性能。相比之下,本文方法可将模型压缩至原始规模9.4%,仅带来7.3%的相对词错误率(WER)退化。我们还对实验结果进行了深入分析,并针对方法局限性和潜在解决方案展开了讨论,这将为未来研究提供重要参考价值。