Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each. However, there are thousands of spoken languages worldwide, and adapting to new languages is an important problem. In this work, we aim to understand which model adapts better to languages unseen during pre-training. We fine-tune both models on 13 unseen languages and 18 seen languages. Our results show that the number of hours seen per language and language family during pre-training is predictive of how the models compare, despite the significant differences in the pre-training methods.
翻译:近年来,如XLS-R和Whisper等模型通过在约100种口语的音频上进行预训练,使多语言语音技术更加普及。然而,全球有数千种口语,适应新语言是一个重要问题。本研究旨在探究哪种模型能更好地适应预训练中未见的语言。我们对这两种模型在13种未见语言和18种已见语言上进行了微调。结果表明,尽管预训练方法存在显著差异,但预训练中每种语言及其语系所见的时长可以预测模型的比较结果。