Multilingual self-supervised learning (SSL) has often lagged behind state-of-the-art (SOTA) methods due to the expenses and complexity required to handle many languages. This further harms the reproducibility of SSL, which is already limited to few research groups due to its resource usage. We show that more powerful techniques can actually lead to more efficient pre-training, opening SSL to more research groups. We propose WavLabLM, which extends WavLM's joint prediction and denoising to 40k hours of data across 136 languages. To build WavLabLM, we devise a novel multi-stage pre-training method, designed to address the language imbalance of multilingual data. WavLabLM achieves comparable performance to XLS-R on ML-SUPERB with less than 10% of the training data, making SSL realizable with academic compute. We show that further efficiency can be achieved with a vanilla HuBERT Base model, which can maintain 94% of XLS-R's performance with only 3% of the data, 4 GPUs, and limited trials. We open-source all code and models in ESPnet.
翻译:多语言自监督学习(SSL)因处理多种语言所需的高昂成本和复杂性,其性能常落后于最先进方法(SOTA)。这进一步损害了SSL的可复现性——因其资源消耗,SSL本就局限于少数研究团队。我们证明,更强大的技术实际上可以带来更高效的预训练,使SSL能够惠及更多研究群体。本文提出WavLabLM,将WavLM的联合预测与去噪方法扩展至涵盖136种语言、总计4万小时的数据。为构建WavLabLM,我们设计了一种新颖的多阶段预训练方法,专门用于解决多语言数据中的语言不平衡问题。在ML-SUPERB基准上,WavLabLM仅需不到10%的训练数据即可达到与XLS-R相当的性能,使得SSL在学术计算资源下变得可行。我们还证明,使用vanilla HuBERT Base模型可进一步提升效率——该模型仅需3%的数据、4块GPU和有限次数的实验,即可保持XLS-R 94%的性能。我们在ESPnet中开源所有代码和模型。