Non-autoregressive (NAR) automatic speech recognition (ASR) models predict tokens independently and simultaneously, bringing high inference speed. However, there is still a gap in the accuracy of the NAR models compared to the autoregressive (AR) models. To further narrow the gap between the NAR and AR models, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EfficientASR. It uses an Index Mapping Vector (IMV) based alignment generator to generate alignments during training, and an alignment predictor to learn the alignments for inference. It can be trained end-to-end (E2E) with cross-entropy loss combined with alignment loss. The proposed EfficientASR achieves competitive results on the AISHELL-1 and AISHELL-2 benchmarks compared to the state-of-the-art (SOTA) models. Specifically, it achieves character error rates (CER) of 4.26%/4.62% on the AISHELL-1 dev/test dataset, which outperforms the SOTA AR Conformer with about 30x inference speedup.
翻译:非自回归自动语音识别模型通过独立并行预测标记实现高推理速度,但其准确率与自回归模型仍存在差距。为进一步缩小非自回归与自回归模型之间的性能差距,本文提出一种兼具高准确率与推理速度的单步非自回归语音识别架构EfficientASR。该架构在训练阶段采用基于索引映射向量的对齐生成器产生对齐序列,并通过对齐预测器学习推理时的对齐信息。模型可结合交叉熵损失与对齐损失进行端到端训练。在AISHELL-1和AISHELL-2基准测试中,所提EfficientASR模型取得了与当前最优模型相媲美的竞争性结果:在AISHELL-1开发集/测试集上分别实现4.26%/4.62%的字错误率,不仅超越自回归最优模型Conformer,更实现了约30倍的推理加速。