While Automatic Speech Recognition (ASR) models have shown significant advances with the introduction of unsupervised or self-supervised training techniques, these improvements are still only limited to a subsection of languages and speakers. Transfer learning enables the adaptation of large-scale multilingual models to not only low-resource languages but also to more specific speaker groups. However, fine-tuning on data from new domains is usually accompanied by a decrease in performance on the original domain. Therefore, in our experiments, we examine how well the performance of large-scale ASR models can be approximated for smaller domains, with our own dataset of German Senior Voice Commands (SVC-de), and how much of the general speech recognition performance can be preserved by selectively freezing parts of the model during training. To further increase the robustness of the ASR model to vocabulary and speakers outside of the fine-tuned domain, we apply Experience Replay for continual learning. By adding only a fraction of data from the original domain, we are able to reach Word-Error-Rates (WERs) below 5\% on the new domain, while stabilizing performance for general speech recognition at acceptable WERs.
翻译:虽然自动语音识别(ASR)模型在引入无监督或自监督训练技术后取得了显著进展,但这些改进仍仅局限于部分语言和说话者群体。迁移学习使得大规模多语言模型不仅能适应低资源语言,还能针对更具体的说话者群体进行调整。然而,在新领域数据上进行微调通常会导致原始领域性能下降。因此,在我们的实验中,通过自建的德语老年语音指令数据集(SVC-de),我们探究了大规模ASR模型在较小领域上的性能可近似程度,以及通过在训练过程中选择性冻结模型部分参数能保留多少通用语音识别能力。为了进一步增强ASR模型对微调领域之外词汇和说话者的鲁棒性,我们采用经验回放(Experience Replay)实现持续学习。通过仅添加少量原始领域数据,我们能在新领域上将词错误率(WER)降至5%以下,同时将通用语音识别的性能稳定在可接受的WER水平。