The cross-domain performance of automatic speech recognition (ASR) could be severely hampered due to the mismatch between training and testing distributions. Since the target domain usually lacks labeled data, and domain shifts exist at acoustic and linguistic levels, it is challenging to perform unsupervised domain adaptation (UDA) for ASR. Previous work has shown that self-supervised learning (SSL) or pseudo-labeling (PL) is effective in UDA by exploiting the self-supervisions of unlabeled data. However, these self-supervisions also face performance degradation in mismatched domain distributions, which previous work fails to address. This work presents a systematic UDA framework to fully utilize the unlabeled data with self-supervision in the pre-training and fine-tuning paradigm. On the one hand, we apply continued pre-training and data replay techniques to mitigate the domain mismatch of the SSL pre-trained model. On the other hand, we propose a domain-adaptive fine-tuning approach based on the PL technique with three unique modifications: Firstly, we design a dual-branch PL method to decrease the sensitivity to the erroneous pseudo-labels; Secondly, we devise an uncertainty-aware confidence filtering strategy to improve pseudo-label correctness; Thirdly, we introduce a two-step PL approach to incorporate target domain linguistic knowledge, thus generating more accurate target domain pseudo-labels. Experimental results on various cross-domain scenarios demonstrate that the proposed approach effectively boosts the cross-domain performance and significantly outperforms previous approaches.
翻译:自动语音识别(ASR)的跨领域性能可能因训练与测试分布不匹配而严重受损。由于目标领域通常缺乏标注数据,且声学与语言层面均存在领域迁移,因此对ASR进行无监督领域适应(UDA)极具挑战性。既有研究表明,自监督学习(SSL)或伪标签(PL)通过利用无标注数据的自监督信号能有效实现UDA。然而,这些自监督方法在领域分布不匹配时同样面临性能退化,既有工作未能解决这一问题。本文提出一种系统的UDA框架,在预训练与微调范式中充分利用带自监督信号的无标注数据。一方面,我们采用持续预训练与数据回放技术来缓解SSL预训练模型的领域失配问题;另一方面,我们提出基于PL技术的领域自适应微调方法,包含三项独特改进:首先,设计双分支PL方法以降低对错误伪标签的敏感性;其次,构建基于不确定性感知的置信度过滤策略以提升伪标签正确性;最后,引入两步式PL方法融合目标领域语言知识,从而生成更精准的目标领域伪标签。在多种跨领域场景下的实验结果表明,所提方法有效提升了跨领域性能,并显著优于既有方法。