Self-supervised learning (SSL) is at the origin of unprecedented improvements in many different domains including computer vision and natural language processing. Speech processing drastically benefitted from SSL as most of the current domain-related tasks are now being approached with pre-trained models. This work introduces LeBenchmark 2.0 an open-source framework for assessing and building SSL-equipped French speech technologies. It includes documented, large-scale and heterogeneous corpora with up to 14,000 hours of heterogeneous speech, ten pre-trained SSL wav2vec 2.0 models containing from 26 million to one billion learnable parameters shared with the community, and an evaluation protocol made of six downstream tasks to complement existing benchmarks. LeBenchmark 2.0 also presents unique perspectives on pre-trained SSL models for speech with the investigation of frozen versus fine-tuned downstream models, task-agnostic versus task-specific pre-trained models as well as a discussion on the carbon footprint of large-scale model training. Overall, the newly introduced models trained on 14,000 hours of French speech outperform multilingual and previous LeBenchmark SSL models across the benchmark but also required up to four times more energy for pre-training.
翻译:自监督学习(SSL)引发了包括计算机视觉和自然语言处理在内的诸多领域的空前进步。语音处理显著受益于SSL,如今多数当前领域相关任务均采用预训练模型。本文介绍LeBenchmark 2.0,一个用于评估和构建法语语音SSL技术的开源框架。该框架包含文档完备、大规模且异构的语料库(总时长高达14,000小时的异构语音)、十个预训练的SSL wav2vec 2.0模型(参数量从2,600万到10亿不等,已与社区共享),以及由六项下游任务组成的评估协议,以补充现有基准。LeBenchmark 2.0还提出了针对预训练SSL语音模型的独特视角:探究冻结与微调的下游模型、任务无关与任务特定的预训练模型,并讨论大规模模型训练的碳足迹。总体而言,基于14,000小时法语语音训练的新模型在基准测试中全面优于多语言模型及前代LeBenchmark SSL模型,但其预训练所需能耗最多可达前者的四倍。