Low-resource accented speech recognition is one of the important challenges faced by current ASR technology in practical applications. In this study, we propose a Conformer-based architecture, called Aformer, to leverage both the acoustic information from large non-accented and limited accented training data. Specifically, a general encoder and an accent encoder are designed in the Aformer to extract complementary acoustic information. Moreover, we propose to train the Aformer in a multi-pass manner, and investigate three cross-information fusion methods to effectively combine the information from both general and accent encoders. All experiments are conducted on both the accented English and Mandarin ASR tasks. Results show that our proposed methods outperform the strong Conformer baseline by relative 10.2% to 24.5% word/character error rate reduction on six in-domain and out-of-domain accented test sets.
翻译:低资源口音语音识别是当前ASR技术在实际应用中面临的重要挑战之一。本研究提出了一种基于Conformer的架构,称为Aformer,以同时利用大规模非口音数据与有限口音训练数据中的声学信息。具体而言,Aformer中设计了通用编码器与口音编码器,用于提取互补的声学信息。此外,我们提出通过多轮方式训练Aformer,并研究了三种跨信息融合方法,以有效结合通用编码器与口音编码器的信息。所有实验均在带口音的英语和普通话ASR任务上进行。结果表明,所提方法在六个领域内与领域外汇集的口音测试集上,相对强基线Conformer实现了10.2%至24.5%的词/字符错误率降低。