Self-supervised learning (SSL) is at the origin of unprecedented improvements in many different domains including computer vision and natural language processing. Speech processing drastically benefitted from SSL as most of the current domain-related tasks are now being approached with pre-trained models. This work introduces LeBenchmark 2.0 an open-source framework for assessing and building SSL-equipped French speech technologies. It includes documented, large-scale and heterogeneous corpora with up to 14,000 hours of heterogeneous speech, ten pre-trained SSL wav2vec 2.0 models containing from 26 million to one billion learnable parameters shared with the community, and an evaluation protocol made of six downstream tasks to complement existing benchmarks. LeBenchmark 2.0 also presents unique perspectives on pre-trained SSL models for speech with the investigation of frozen versus fine-tuned downstream models, task-agnostic versus task-specific pre-trained models as well as a discussion on the carbon footprint of large-scale model training.
翻译:自监督学习(SSL)是计算机视觉和自然语言处理等多个领域取得前所未有的进步的关键因素。语音处理在很大程度上受益于SSL,因为当前大多数与该领域相关的任务现在都借助预训练模型来完成。本文介绍了LeBenchmark 2.0,这是一个用于评估和构建法语语音SSL技术的开源框架。它包含有文档记录、大规模且异构的语料库(语音时长高达14,000小时),十个预训练的SSL wav2vec 2.0模型(参数量从2600万到10亿不等,已与社区共享),以及一个由六项下游任务组成的评估协议,以补充现有基准。LeBenchmark 2.0还提供了关于预训练SSL语音模型的独特视角,包括对冻结与微调下游模型、任务无关与任务特定预训练模型的研究,以及关于大规模模型训练碳足迹的讨论。