Multilingual self-supervised speech representation models have greatly enhanced the speech recognition performance for low-resource languages, and the compression of these huge models has also become a crucial prerequisite for their industrial application. In this paper, we propose DistilXLSR, a distilled cross-lingual speech representation model. By randomly shuffling the phonemes of existing speech, we reduce the linguistic information and distill cross-lingual models using only English data. We also design a layer-jumping initialization method to fully leverage the teacher's pre-trained weights. Experiments on 2 kinds of teacher models and 15 low-resource languages show that our method can reduce the parameters by 50% while maintaining cross-lingual representation ability. Our method is proven to be generalizable to various languages/teacher models and has the potential to improve the cross-lingual performance of the English pre-trained models.
翻译:多语言自监督语音表示模型极大地提升了低资源语言的语音识别性能,而对这些庞大模型的压缩也成为其工业应用的关键前提。本文提出DistilXLSR,一种蒸馏式跨语言语音表示模型。通过随机打乱现有语音的音素,我们降低了语言信息量,并仅利用英语数据进行跨语言模型的蒸馏。我们还设计了一种跳层初始化方法,以充分利用教师模型的预训练权重。在2种教师模型和15种低资源语言上的实验表明,我们的方法在保持跨语言表示能力的同时,可将参数量减少50%。该方法被证明可泛化至多种语言/教师模型,并具有提升英语预训练模型跨语言性能的潜力。