Due to the rapid development of computing hardware resources and the dramatic growth of data, pre-trained models in speech recognition, such as Whisper, have significantly improved the performance of speech recognition tasks. However, these models usually have a high computational overhead, making it difficult to execute effectively on resource-constrained devices. To speed up inference and reduce model size while maintaining performance, we propose a novel guided knowledge distillation and quantization for large pre-trained model Whisper. The student model selects distillation and quantization layers based on quantization loss and distillation loss, respectively. We compressed $\text{Whisper}_\text{small}$ to $\text{Whisper}_\text{base}$ and $\text{Whisper}_\text{tiny}$ levels, making $\text{Whisper}_\text{small}$ 5.18x/10.48x smaller, respectively. Moreover, compared to the original $\text{Whisper}_\text{base}$ and $\text{Whisper}_\text{tiny}$, there is also a relative character error rate~(CER) reduction of 11.3% and 14.0% for the new compressed model respectively.
翻译:随着计算硬件资源的快速发展和数据规模的急剧增长,语音识别领域的预训练模型(如Whisper)显著提升了语音识别任务的性能。然而,这些模型通常具有较高的计算开销,难以在资源受限设备上高效运行。为在保持性能的同时加速推理并缩小模型规模,我们提出了一种新颖的引导式知识蒸馏与量化方法,用于大型预训练模型Whisper。学生模型根据量化损失和蒸馏损失分别选择蒸馏层与量化层。我们将$\text{Whisper}_\text{small}$压缩至$\text{Whisper}_\text{base}$和$\text{Whisper}_\text{tiny}$级别,使$\text{Whisper}_\text{small}$的规模分别缩小5.18倍和10.48倍。此外,与原始$\text{Whisper}_\text{base}$和$\text{Whisper}_\text{tiny}$相比,新压缩模型的字符错误率(CER)分别相对降低了11.3%和14.0%。