Recent progress in Automatic Speech Recognition (ASR) has been coupled with a substantial increase in the model sizes, which may now contain billions of parameters, leading to slow inferences even with adapted hardware. In this context, several ASR models exist in various sizes, with different inference costs leading to different performance levels. Based on the observation that smaller models perform optimally on large parts of testing corpora, we propose to train a decision module, that would allow, given an audio sample, to use the smallest sufficient model leading to a good transcription. We apply our approach to two Whisper models with different sizes. By keeping the decision process computationally efficient, we build a decision module that allows substantial computational savings with reduced performance drops.
翻译:近年来,自动语音识别(ASR)的进展伴随着模型规模的显著增大,当前模型可能包含数十亿参数,即便在适配硬件上仍会导致推理速度缓慢。在此背景下,多种不同规模的ASR模型并存,其推理成本与性能表现各异。基于较小模型在大部分测试语料上表现最优这一观察,我们提出训练一个决策模块,该模块能针对给定音频样本,自动调用足以完成准确转录的最小规模模型。我们将该方法应用于两种不同规模的Whisper模型。通过保持决策过程的计算高效性,我们构建的决策模块能在性能损失可控的前提下实现大幅计算节省。