Pre-training with self-supervised models, such as Hidden-unit BERT (HuBERT) and wav2vec 2.0, has brought significant improvements in automatic speech recognition (ASR). However, these models usually require an expensive computational cost to achieve outstanding performance, slowing down the inference speed. To improve the model efficiency, we introduce an early exit scheme for ASR, namely HuBERT-EE, that allows the model to stop the inference dynamically. In HuBERT-EE, multiple early exit branches are added at the intermediate layers. When the intermediate prediction of the early exit branch is confident, the model stops the inference, and the corresponding result can be returned early. We investigate the proper early exiting criterion and fine-tuning strategy to effectively perform early exiting. Experimental results on the LibriSpeech show that HuBERT-EE can accelerate the inference of the HuBERT while simultaneously balancing the trade-off between the performance and the latency.
翻译:利用自监督模型(如隐单元BERT(HuBERT)和wav2vec 2.0)进行预训练,为自动语音识别(ASR)带来了显著性能提升。然而,这些模型通常需要高昂的计算成本才能实现优异性能,从而降低了推理速度。为提高模型效率,我们为ASR引入了一种提前退出机制,即HuBERT-EE,使模型能够动态停止推理。在HuBERT-EE中,我们在中间层添加了多个提前退出分支。当提前退出分支的中间预测结果置信度较高时,模型将停止推理,并提前返回相应结果。我们研究了合适的提前退出准则与微调策略,以有效执行提前退出。在LibriSpeech数据集上的实验结果表明,HuBERT-EE能够加速HuBERT的推理过程,同时在性能与延迟之间实现良好平衡。