The ongoing research scenario for automatic speech recognition (ASR) envisions a clear division between end-to-end approaches and classic modular systems. Even though a high-level comparison between the two approaches in terms of their requirements and (dis)advantages is commonly addressed, a closer comparison under similar conditions is not readily available in the literature. In this work, we present a comparison focused on the label topology and training criterion. We compare two discriminative alignment models with hidden Markov model (HMM) and connectionist temporal classification topology, and two first-order label context ASR models utilizing factored HMM and strictly monotonic recurrent neural network transducer, respectively. We use different measurements for the evaluation of the alignment quality, and compare word error rate and real time factor of our best systems. Experiments are conducted on the LibriSpeech 960h and Switchboard 300h tasks.
翻译:当前自动语音识别(ASR)的研究格局呈现出端到端方法与经典模块化系统的明确分野。尽管两种方法在需求与优劣势方面的高层次比较已有普遍探讨,但在相似条件下的细致对比研究在文献中仍较为缺乏。本工作聚焦于标签拓扑与训练准则展开对比研究。我们比较了两种判别式对齐模型(分别采用隐马尔可夫模型(HMM)与联结时序分类拓扑),以及两种一阶标签上下文ASR模型(分别采用分解式HMM与严格单调循环神经网络转换器)。我们采用多种度量方法评估对齐质量,并对比了最优系统的词错误率与实时因子。实验在LibriSpeech 960小时和Switchboard 300小时任务上进行。