Contextual-LAS (CLAS) has been shown effective in improving Automatic Speech Recognition (ASR) of rare words. It relies on phrase-level contextual modeling and attention-based relevance scoring without explicit contextual constraint which lead to insufficient use of contextual information. In this work, we propose deep CLAS to use contextual information better. We introduce bias loss forcing model to focus on contextual information. The query of bias attention is also enriched to improve the accuracy of the bias attention score. To get fine-grained contextual information, we replace phrase-level encoding with character-level encoding and encode contextual information with conformer rather than LSTM. Moreover, we directly use the bias attention score to correct the output probability distribution of the model. Experiments using the public AISHELL-1 and AISHELL-NER. On AISHELL-1, compared to CLAS baselines, deep CLAS obtains a 65.78% relative recall and a 53.49% relative F1-score increase in the named entity recognition scene.
翻译:上下文-听音、注意与拼写模型(CLAS)已被证明能有效提升罕见词的自动语音识别性能。该模型依赖短语级上下文建模和基于注意力的相关性评分,但缺乏显式上下文约束机制,导致上下文信息利用不足。本研究提出深度CLAS以更有效地利用上下文信息。我们引入偏置损失函数以强制模型聚焦于上下文信息,同时通过增强偏置注意力的查询向量来提升注意力评分的准确性。为获取细粒度上下文信息,我们将短语级编码替换为字符级编码,并采用Conformer而非LSTM进行上下文信息编码。此外,我们直接利用偏置注意力评分来修正模型的输出概率分布。在公开数据集AISHELL-1和AISHELL-NER上的实验表明:在AISHELL-1数据集的命名实体识别场景中,相较于CLAS基线模型,深度CLAS实现了65.78%的相对召回率提升和53.49%的相对F1分数提升。