End-to-end automatic speech recognition (ASR) models have seen revolutionary quality gains with the recent development of large-scale universal speech models (USM). However, deploying these massive USMs is extremely expensive due to the enormous memory usage and computational cost. Therefore, model compression is an important research topic to fit USM-based ASR under budget in real-world scenarios. In this study, we propose a USM fine-tuning approach for ASR, with a low-bit quantization and N:M structured sparsity aware paradigm on the model weights, reducing the model complexity from parameter precision and matrix topology perspectives. We conducted extensive experiments with a 2-billion parameter USM on a large-scale voice search dataset to evaluate our proposed method. A series of ablation studies validate the effectiveness of up to int4 quantization and 2:4 sparsity. However, a single compression technique fails to recover the performance well under extreme setups including int2 quantization and 1:4 sparsity. By contrast, our proposed method can compress the model to have 9.4% of the size, at the cost of only 7.3% relative word error rate (WER) regressions. We also provided in-depth analyses on the results and discussions on the limitations and potential solutions, which would be valuable for future studies.
翻译:端到端自动语音识别(ASR)模型得益于近期大规模通用语音模型(USM)的发展,在识别质量上取得了革命性提升。然而,这些大规模USM模型由于极高的内存占用和计算成本,导致其部署成本极为高昂。因此,模型压缩成为在现实场景预算限制下部署基于USM的ASR系统的重要研究方向。本研究提出一种面向ASR的USM微调方法,通过在模型权重上引入低位量化与N:M结构化稀疏感知机制,从参数精度和矩阵拓扑两个维度降低模型复杂度。我们使用20亿参数的USM在大规模语音搜索数据集上进行了广泛实验评估。系列消融实验验证了int4量化与2:4稀疏度的有效性。然而,在int2量化和1:4稀疏度等极端设置下,单一压缩技术难以恢复模型性能。相比之下,我们提出的方法可将模型压缩至原始规模的9.4%,仅带来7.3%的相对词错误率(WER)退化。我们还提供了对结果的深入分析,并讨论了现有局限与潜在解决方案,这些将为后续研究提供重要参考价值。