As the size of pre-trained speech recognition models increases, running these large models in low-latency or resource-constrained environments becomes challenging. In this work, we leverage pseudo-labelling to assemble a large-scale open-source dataset which we use to distill the Whisper model into a smaller variant, called Distil-Whisper. Using a simple word error rate (WER) heuristic, we select only the highest quality pseudo-labels for training. The distilled model is 5.8 times faster with 51% fewer parameters, while performing to within 1% WER on out-of-distribution test data in a zero-shot transfer setting. Distil-Whisper maintains the robustness of the Whisper model to difficult acoustic conditions, while being less prone to hallucination errors on long-form audio. Distil-Whisper is designed to be paired with Whisper for speculative decoding, yielding a 2 times speed-up while mathematically ensuring the same outputs as the original model. To facilitate further research in this domain, we make our training code, inference code and models publicly accessible.
翻译:随着预训练语音识别模型规模的增大,在低延迟或资源受限环境中运行这些大型模型变得具有挑战性。在本工作中,我们利用伪标记技术构建了一个大规模开源数据集,并基于该数据集将Whisper模型蒸馏为更小的变体——Distil-Whisper。通过采用简单的词错误率(WER)启发式方法,我们仅选取最高质量的伪标签进行训练。该蒸馏模型参数量减少51%,推理速度提升5.8倍,同时在零样本迁移场景的分布外测试数据上,其词错误率(WER)与原始模型相差不超过1%。Distil-Whisper既保持了Whisper模型对复杂声学条件的鲁棒性,又降低了长音频幻觉错误的倾向。该模型专为与Whisper配合进行推测解码而设计,可在数学上确保输出与原始模型一致的前提下实现2倍加速。为推动该领域的进一步研究,我们公开了训练代码、推理代码及模型。