Non-autoregressive (NAR) automatic speech recognition (ASR) models predict tokens independently and simultaneously, bringing high inference speed. However, there is still a gap in the accuracy of the NAR models compared to the autoregressive (AR) models. In this paper, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EffectiveASR. It uses an Index Mapping Vector (IMV) based alignment generator to generate alignments during training, and an alignment predictor to learn the alignments for inference. It can be trained end-to-end (E2E) with cross-entropy loss combined with alignment loss. The proposed EffectiveASR achieves competitive results on the AISHELL-1 and AISHELL-2 Mandarin benchmarks compared to the leading models. Specifically, it achieves character error rates (CER) of 4.26%/4.62% on the AISHELL-1 dev/test dataset, which outperforms the AR Conformer with about 30x inference speedup.
翻译:非自回归自动语音识别模型通过独立且同步地预测语音单元,实现了较高的推理速度。然而,与自回归模型相比,非自回归模型在识别精度上仍存在差距。本文提出了一种兼具高精度与高推理速度的单步非自回归语音识别架构,称为EffectiveASR。该架构采用基于索引映射向量的对齐生成器在训练阶段生成对齐关系,并利用对齐预测器在推理阶段学习对齐信息。模型可通过结合交叉熵损失与对齐损失的端到端方式进行训练。在AISHELL-1和AISHELL-2普通话基准测试中,所提出的EffectiveASR模型取得了与主流模型相竞争的结果。具体而言,在AISHELL-1开发集/测试集上实现了4.26%/4.62%的字错误率,在推理速度提升约30倍的同时,性能超越了自回归Conformer模型。