While many speakers of low-resource languages regularly code-switch between their languages and other regional languages or English, datasets of codeswitched speech are too small to train bespoke acoustic models from scratch or do language model rescoring. Here we propose finetuning self-supervised speech representations such as wav2vec 2.0 XLSR to recognize code-switched data. We find that finetuning self-supervised multilingual representations and augmenting them with n-gram language models trained from transcripts reduces absolute word error rates by up to 20% compared to baselines of hybrid models trained from scratch on code-switched data. Our findings suggest that in circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
翻译:尽管低资源语言的多数使用者会频繁在其语言与其他区域语言或英语之间进行代码混合,但代码混合语音的数据集规模过小,难以从头训练定制声学模型或进行语言模型重评分。本文提出通过微调wav2vec 2.0 XLSR等自监督语音表征来识别代码混合数据。研究发现,与从头在代码混合数据上训练的混合模型基线相比,微调多语言自监督表征并辅以转录文本训练的n-gram语言模型,可使绝对词错误率降低高达20%。实验结果表明,在训练数据有限的条件下,微调自监督表征是性能更优且可行的解决方案。