Improving Contextual Spelling Correction by External Acoustics Attention and Semantic Aware Data Augmentation

We previously proposed contextual spelling correction (CSC) to correct the output of end-to-end (E2E) automatic speech recognition (ASR) models with contextual information such as name, place, etc. Although CSC has achieved reasonable improvement in the biasing problem, there are still two drawbacks for further accuracy improvement. First, due to information limitation in text only hypothesis or weak performance of ASR model on rare domains, the CSC model may fail to correct phrases with similar pronunciation or anti-context cases where all biasing phrases are not present in the utterance. Second, there is a discrepancy between the training and inference of CSC. The bias list in training is randomly selected but in inference there may be more similarity between ground truth phrase and other phrases. To solve above limitations, in this paper we propose an improved non-autoregressive (NAR) spelling correction model for contextual biasing in E2E neural transducer-based ASR systems to improve the previous CSC model from two perspectives: Firstly, we incorporate acoustics information with an external attention as well as text hypotheses into CSC to better distinguish target phrase from dissimilar or irrelevant phrases. Secondly, we design a semantic aware data augmentation schema in training phrase to reduce the mismatch between training and inference to further boost the biasing accuracy. Experiments show that the improved method outperforms the baseline ASR+Biasing system by as much as 20.3% relative name recall gain and achieves stable improvement compared to the previous CSC method over different bias list name coverage ratio.

翻译：我们先前提出了上下文化拼写校正（CSC）方法，用于利用上下文信息（如人名、地名等）修正端到端（E2E）自动语音识别（ASR）模型的输出。尽管CSC在偏向性问题上已取得合理改进，但其准确率仍有进一步提升的空间。首先，由于仅基于文本假设的信息局限性或ASR模型在罕见领域表现不佳，CSC模型可能无法纠正发音相似短语或反上下文情况（即话语中不包含任何偏向性短语）。其次，CSC的训练与推理之间存在差异：训练时偏向列表随机选取，而推理时真实短语与其他短语可能存在更高相似度。为解决上述限制，本文提出一种改进的非自回归（NAR）拼写校正模型，用于基于E2E神经换能器的ASR系统中的上下文偏向校正，从两方面提升现有CSC模型：其一，通过外部注意力机制融合声学信息与文本假设，更准确地区分目标短语与非相似或不相关短语；其二，设计语义感知数据增强方案以减少训练与推理间的失配，进一步提升偏向准确性。实验表明，改进方法相较基线ASR+偏向系统实现了高达20.3%的相对名称召回率提升，且在多种偏向列表名称覆盖比例下均较原有CSC方法取得稳定改进。