While Automatic Speech Recognition (ASR) models have shown significant advances with the introduction of unsupervised or self-supervised training techniques, these improvements are still only limited to a subsection of languages and speakers. Transfer learning enables the adaptation of large-scale multilingual models to not only low-resource languages but also to more specific speaker groups. However, fine-tuning on data from new domains is usually accompanied by a decrease in performance on the original domain. Therefore, in our experiments, we examine how well the performance of large-scale ASR models can be approximated for smaller domains, with our own dataset of German Senior Voice Commands (SVC-de), and how much of the general speech recognition performance can be preserved by selectively freezing parts of the model during training. To further increase the robustness of the ASR model to vocabulary and speakers outside of the fine-tuned domain, we apply Experience Replay for continual learning. By adding only a fraction of data from the original domain, we are able to reach Word-Error-Rates (WERs) below 5\% on the new domain, while stabilizing performance for general speech recognition at acceptable WERs.
翻译:尽管自动语音识别(ASR)模型通过引入无监督或自监督训练技术取得了显著进展,但这些改进仍仅限于部分语言和说话者群体。迁移学习使得大规模多语言模型不仅能适配低资源语言,还能适应更具体的说话者群体。然而,在新领域数据上进行微调通常会伴随着原始领域性能的下降。因此,在本实验中,我们利用自建的德语老年人语音指令数据集(SVC-de)探究:对于较小领域,大规模ASR模型的性能能够被近似到何种程度;同时,通过在训练过程中选择性地冻结模型部分参数,能够保留多少通用语音识别性能。为进一步提升ASR模型对微调领域外词汇和说话者的鲁棒性,我们应用经验回放进行持续学习。通过仅添加来自原始领域的一小部分数据,我们能够在新领域上实现低于5%的词错误率(WER),同时将通用语音识别的性能稳定在可接受的WER范围内。