This paper proposes a novel non-autoregressive (NAR) block-based Attention Mask Decoder (AMD) that flexibly balances performance-efficiency trade-offs for Conformer ASR systems. AMD performs parallel NAR inference within contiguous blocks of output labels that are concealed using attention masks, while conducting left-to-right AR prediction and history context amalgamation between blocks. A beam search algorithm is designed to leverage a dynamic fusion of CTC, AR Decoder, and AMD probabilities. Experiments on the LibriSpeech-100hr corpus suggest the tripartite Decoder incorporating the AMD module produces a maximum decoding speed-up ratio of 1.73x over the baseline CTC+AR decoding, while incurring no statistically significant word error rate (WER) increase on the test sets. When operating with the same decoding real time factors, statistically significant WER reductions of up to 0.7% and 0.3% absolute (5.3% and 6.1% relative) were obtained over the CTC+AR baseline.
翻译:本文提出了一种新颖的非自回归分块注意力掩码解码器,该解码器能够灵活权衡Conformer ASR系统的性能与效率。AMD在输出标签的连续块内通过注意力掩码遮蔽机制执行并行非自回归推理,同时在块间进行从左到右的自回归预测与历史上下文融合。我们设计了束搜索算法以动态融合CTC、自回归解码器与AMD的概率分布。在LibriSpeech-100小时语料库上的实验表明:集成AMD模块的三元解码器相比基线CTC+AR解码可获得最高1.73倍的解码加速比,且在测试集上未出现统计显著的词错误率上升。在保持相同解码实时因子的条件下,相比CTC+AR基线实现了最高0.7%和0.3%的绝对词错误率降低(相对降低5.3%和6.1%),且具有统计显著性。