The performance of automatic speech recognition models often degenerates on domains not covered by the training data. Domain adaptation can address this issue, assuming the availability of the target domain data in the target language. However, such assumption does not stand in many real-world applications. To make domain adaptation more applicable, we address the problem of zero-shot domain adaptation (ZSDA), where target domain data is unavailable in the target language. Instead, we transfer the target domain knowledge from another source language where the target domain data is more accessible. To do that, we first perform cross-lingual pre-training (XLPT) to share domain knowledge across languages, then use target language fine-tuning to build the final model. One challenge in this practice is that the pre-trained knowledge can be forgotten during fine-tuning, resulting in sub-optimal adaptation performance. To address this issue, we propose transliterated ZSDA to achieve consistent pre-training and fine-tuning labels, leading to maximum preservation of the pre-trained knowledge. Experimental results show that transliterated ZSDA relatively decreases the word error rate by 9.2% compared with a wav2vec 2.0 baseline. Moreover, transliterated ZSDA consistently outperforms self-supervised ZSDA and performs on par with supervised ZSDA, proving the superiority of transliteration-based pre-training labels.
翻译:自动语音识别模型的性能在训练数据未覆盖的领域上通常会下降。领域自适应可以解决这一问题,但前提是能够获取目标语言的目标领域数据。然而,这一假设在许多实际应用中并不成立。为了使领域自适应更具适用性,我们研究了零样本领域自适应问题,即目标语言中缺乏目标领域数据。为此,我们通过另一种源语言(其中目标领域数据更易获取)来迁移目标领域知识。具体而言,我们首先进行跨语言预训练以共享跨语言的领域知识,然后通过目标语言微调构建最终模型。该方法的一个挑战在于,预训练获得的知识可能在微调过程中被遗忘,导致自适应性能欠佳。为解决这一问题,我们提出音译零样本领域自适应方法,通过实现预训练与微调标签的一致性,最大限度地保留预训练知识。实验结果表明,与wav2vec 2.0基线相比,音译零样本领域自适应方法相对降低了9.2%的词错误率。此外,该方法持续优于自监督零样本领域自适应,并与监督式零样本领域自适应性能相当,证明了基于音译的预训练标签的优越性。