This paper presents an overview and evaluation of some of the end-to-end ASR models on long-form audios. We study three categories of Automatic Speech Recognition(ASR) models based on their core architecture: (1) convolutional, (2) convolutional with squeeze-and-excitation and (3) convolutional models with attention. We selected one ASR model from each category and evaluated Word Error Rate, maximum audio length and real-time factor for each model on a variety of long audio benchmarks: Earnings-21 and 22, CORAAL, and TED-LIUM3. The model from the category of self-attention with local attention and global token has the best accuracy comparing to other architectures. We also compared models with CTC and RNNT decoders and showed that CTC-based models are more robust and efficient than RNNT on long form audio.
翻译:本文概述并评估了若干端到端自动语音识别(ASR)模型在长语音上的表现。我们根据核心架构将ASR模型分为三类进行系统性研究:(1)卷积模型,(2)带挤压激励的卷积模型,以及(3)带有注意力机制的卷积模型。从每类中选择一个代表性ASR模型,在多个长语音基准数据集(Earnings-21、Earnings-22、CORAAL、TED-LIUM3)上评估其词错误率、最大音频长度及实时因子。采用局部注意力与全局token的自注意力模型在准确率上优于其他架构。同时比较了CTC解码器与RNNT解码器的性能差异,结果表明基于CTC的模型在长语音转录任务中比RNNT更具鲁棒性和效率。