In this study, we introduce YODAS (YouTube-Oriented Dataset for Audio and Speech), a large-scale, multilingual dataset comprising currently over 500k hours of speech data in more than 100 languages, sourced from both labeled and unlabeled YouTube speech datasets. The labeled subsets, including manual or automatic subtitles, facilitate supervised model training. Conversely, the unlabeled subsets are apt for self-supervised learning applications. YODAS is distinctive as the first publicly available dataset of its scale, and it is distributed under a Creative Commons license. We introduce the collection methodology utilized for YODAS, which contributes to the large-scale speech dataset construction. Subsequently, we provide a comprehensive analysis of speech, text contained within the dataset. Finally, we describe the speech recognition baselines over the top-15 languages.
翻译:本研究介绍了YODAS(面向YouTube的音频与语音数据集),这是一个大规模多语言数据集,目前包含超过500,000小时的语音数据,涵盖100多种语言,数据来源于带标签和无标签的YouTube语音数据集。包含人工或自动字幕的带标签子集可用于监督式模型训练,而无标签子集则适用于自监督学习应用。YODAS作为首个公开可用的同规模数据集具有独特性,并采用知识共享许可协议进行分发。我们介绍了构建YODAS所采用的数据收集方法,该方法对大规模语音数据集建设具有参考价值。随后我们对数据集内的语音及文本内容进行了全面分析,最后针对使用量最高的15种语言描述了语音识别基线模型。