End-to-end automatic speech recognition (ASR) models have seen revolutionary quality gains with the recent development of large-scale universal speech models (USM). However, deploying these massive USMs is extremely expensive due to the enormous memory usage and computational cost. Therefore, model compression is an important research topic to fit USM-based ASR under budget in real-world scenarios. In this study, we propose a USM fine-tuning approach for ASR, with a low-bit quantization and N:M structured sparsity aware paradigm on the model weights, reducing the model complexity from parameter precision and matrix topology perspectives. We conducted extensive experiments with a 2-billion parameter USM on a large-scale voice search dataset to evaluate our proposed method. A series of ablation studies validate the effectiveness of up to int4 quantization and 2:4 sparsity. However, a single compression technique fails to recover the performance well under extreme setups including int2 quantization and 1:4 sparsity. By contrast, our proposed method can compress the model to have 9.4% of the size, at the cost of only 7.3% relative word error rate (WER) regressions. We also provided in-depth analyses on the results and discussions on the limitations and potential solutions, which would be valuable for future studies.
翻译:端到端自动语音识别(ASR)模型因近期大规模通用语音模型(USM)的发展而取得了革命性质量提升。然而,由于巨大的内存占用和计算成本,部署这些庞大的USM模型极为昂贵。因此,模型压缩成为在真实场景预算限制下部署基于USM的ASR系统的重要研究方向。本研究提出一种面向ASR的USM微调方法,该方法在模型权重上采用低比特量化与N:M结构化稀疏感知范式,从参数精度和矩阵拓扑两个维度降低模型复杂度。我们基于20亿参数的USM在大规模语音搜索数据集上进行了广泛实验,通过系列消融研究验证了int4量化和2:4稀疏性的有效性。然而,在int2量化和1:4稀疏性等极端设置下,单一压缩技术难以恢复模型性能。相比之下,我们提出的方法可将模型压缩至原始大小的9.4%,同时仅带来7.3%的相对词错误率(WER)退化。此外,我们还对结果进行了深入分析,并讨论了现有局限性和潜在解决方案,这将为未来研究提供重要参考价值。