In this paper, we investigate domain adaptation for low-resource Automatic Speech Recognition (ASR) of target-domain data, when a well-trained ASR model trained with a large dataset is available. We argue that in the encoder-decoder framework, the decoder of the well-trained ASR model is largely tuned towards the source-domain, hurting the performance of target-domain models in vanilla transfer-learning. On the other hand, the encoder layers of the well-trained ASR model mostly capture the acoustic characteristics. We, therefore, propose to use the embeddings tapped from these encoder layers as features for a downstream Conformer target-domain model and show that they provide significant improvements. We do ablation studies on which encoder layer is optimal to tap the embeddings, as well as the effect of freezing or updating the well-trained ASR model's encoder layers. We further show that applying Spectral Augmentation (SpecAug) on the proposed features (this is in addition to default SpecAug on input spectral features) provides a further improvement on the target-domain performance. For the LibriSpeech-100-clean data as target-domain and SPGI-5000 as a well-trained model, we get 30% relative improvement over baseline. Similarly, with WSJ data as target-domain and LibriSpeech-960 as a well-trained model, we get 50% relative improvement over baseline.
翻译:本文研究在拥有使用大规模数据集训练良好的自动语音识别(ASR)模型时,针对目标域数据的低资源域自适应问题。我们论证,在编码器-解码器框架中,训练良好的ASR模型的解码器在很大程度上倾向于源域,从而损害了标准迁移学习中目标域模型的性能。另一方面,训练良好的ASR模型的编码器层主要捕捉声学特征。因此,我们提出将这些编码器层中提取的嵌入特征用作下游Conformer目标域模型的输入,并证明它们能带来显著改进。我们进行了消融研究,探讨了提取嵌入的最佳编码器层,以及冻结或更新训练良好的ASR模型编码器层的影响。此外,我们还证明,对提出的特征应用频谱增强(SpecAug)(这是对输入频谱特征默认SpecAug的补充)能进一步提升目标域性能。以LibriSpeech-100-clean数据作为目标域、SPGI-5000作为训练良好的模型时,我们相对于基线获得了30%的相对改进。类似地,以WSJ数据作为目标域、LibriSpeech-960作为训练良好的模型时,我们获得了50%的相对改进。