Multilingual self-supervised learning (SSL) has often lagged behind state-of-the-art (SOTA) methods due to the expenses and complexity required to handle many languages. This further harms the reproducibility of SSL, which is already limited to few research groups due to its resource usage. We show that more powerful techniques can actually lead to more efficient pre-training, opening SSL to more research groups. We propose WavLabLM, which extends WavLM's joint prediction and denoising to 40k hours of data across 136 languages. To build WavLabLM, we devise a novel multi-stage pre-training method, designed to address the language imbalance of multilingual data. WavLabLM achieves comparable performance to XLS-R on ML-SUPERB with less than 10% of the training data, making SSL realizable with academic compute. We show that further efficiency can be achieved with a vanilla HuBERT Base model, which can maintain 94% of XLS-R's performance with only 3% of the data, 4 GPUs, and limited trials. We open-source all code and models in ESPnet.
翻译:多语言自监督学习(SSL)因处理多种语言所需的高昂成本和复杂性,往往落后于最先进的(SOTA)方法,且其资源消耗本就将SSL的可复现性局限于少数研究团队。我们证明,更强大的技术实际上能实现更高效的预训练,从而让更多研究团队能够使用SSL。我们提出WavLabLM,该方法将WavLM的联合预测与去噪机制扩展至涵盖136种语言的4万小时数据。为构建WavLabLM,我们设计了一种新型多阶段预训练方法,以解决多语言数据中的语言不平衡问题。WavLabLM在ML-SUPERB基准上使用不到XLS-R 10%的训练数据即可达到与其相当的性能,使得SSL在学术计算资源下成为可能。此外,我们发现采用基础版HuBERT模型可进一步提升效率:仅使用3%的数据、4块GPU和有限实验次数,即可保持XLS-R 94%的性能。我们已在ESPnet中开源所有代码与模型。