Attention-based encoder-decoder, e.g. transformer and its variants, generates the output sequence in an autoregressive (AR) manner. Despite its superior performance, AR model is computationally inefficient as its generation requires as many iterations as the output length. In this paper, we propose Paraformer-v2, an improved version of Paraformer, for fast, accurate, and noise-robust non-autoregressive speech recognition. In Paraformer-v2, we use a CTC module to extract the token embeddings, as the alternative to the continuous integrate-and-fire module in Paraformer. Extensive experiments demonstrate that Paraformer-v2 outperforms Paraformer on multiple datasets, especially on the English datasets (over 14% improvement on WER), and is more robust in noisy environments.
翻译:基于注意力的编码器-解码器(例如Transformer及其变体)以自回归方式生成输出序列。尽管其性能优越,但自回归模型在计算上效率低下,因为其生成过程所需的迭代次数与输出长度相同。本文提出Paraformer-v2,作为Paraformer的改进版本,用于实现快速、准确且噪声鲁棒的非自回归语音识别。在Paraformer-v2中,我们使用CTC模块提取词元嵌入,以替代Paraformer中的连续积分触发模块。大量实验表明,Paraformer-v2在多个数据集上优于Paraformer,尤其在英文数据集上(词错误率提升超过14%),并且在噪声环境中表现出更强的鲁棒性。