Current ASR systems are mainly trained and evaluated at the utterance level. Long range cross utterance context can be incorporated. A key task is to derive a suitable compact representation of the most relevant history contexts. In contrast to previous researches based on either LSTM-RNN encoded histories that attenuate the information from longer range contexts, or frame level concatenation of transformer context embeddings, in this paper compact low-dimensional cross utterance contextual features are learned in the Conformer-Transducer Encoder using specially designed attention pooling layers that are applied over efficiently cached preceding utterances history vectors. Experiments on the 1000-hr Gigaspeech corpus demonstrate that the proposed contextualized streaming Conformer-Transducers outperform the baseline using utterance internal context only with statistically significant WER reductions of 0.7% to 0.5% absolute (4.3% to 3.1% relative) on the dev and test data.
翻译:当前自动语音识别系统主要在话语级别进行训练和评估。通过引入跨话语的长程上下文信息可提升系统性能,其关键任务是提取最相关历史上下文的紧凑表征。不同于以往基于LSTM-RNN编码历史导致长程上下文信息衰减的方法,或基于Transformer上下文嵌入的帧级拼接方法,本文在Conformer-Transducer编码器中利用专门设计的注意力池化层,对高效缓存的前序话语历史向量进行特征整合,从而学习紧凑的低维跨话语上下文特征。在1000小时Gigaspeech语料库上的实验表明,所提出的上下文化流式Conformer-Transducer相较于仅使用话语内部上下文的基线系统,在开发集和测试集上分别获得0.7%至0.5%的绝对词错误率降低(4.3%至3.1%的相对降低),且具有统计显著性。