Contextual-LAS (CLAS) has been shown effective in improving Automatic Speech Recognition (ASR) of rare words. It relies on phrase-level contextual modeling and attention-based relevance scoring without explicit contextual constraint which lead to insufficient use of contextual information. In this work, we propose deep CLAS to use contextual information better. We introduce bias loss forcing model to focus on contextual information. The query of bias attention is also enriched to improve the accuracy of the bias attention score. To get fine-grained contextual information, we replace phrase-level encoding with character-level encoding and encode contextual information with conformer rather than LSTM. Moreover, we directly use the bias attention score to correct the output probability distribution of the model. Experiments using the public AISHELL-1 and AISHELL-NER. On AISHELL-1, compared to CLAS baselines, deep CLAS obtains a 65.78% relative recall and a 53.49% relative F1-score increase in the named entity recognition scene.
翻译:上下文LAS(CLAS)已被证明能有效提升自动语音识别(ASR)对罕见词的识别能力。该方法依赖短语级上下文建模和基于注意力的相关性评分,但缺乏显式的上下文约束机制,导致上下文信息利用不足。本研究提出深度CLAS以更有效地利用上下文信息。我们引入偏置损失函数以强制模型聚焦于上下文信息,并通过增强偏置注意力的查询向量来提升注意力评分的准确性。为获取细粒度上下文信息,我们将短语级编码替换为字符级编码,并使用Conformer替代LSTM进行上下文信息编码。此外,我们直接利用偏置注意力评分修正模型的输出概率分布。在公开数据集AISHELL-1和AISHELL-NER上的实验表明:在AISHELL-1的命名实体识别场景中,深度CLAS相较于CLAS基线模型,相对召回率提升65.78%,相对F1分数提升53.49%。