In this paper, we introduce Libriheavy, a large-scale ASR corpus consisting of 50,000 hours of read English speech derived from LibriVox. To the best of our knowledge, Libriheavy is the largest freely-available corpus of speech with supervisions. Different from other open-sourced datasets that only provide normalized transcriptions, Libriheavy contains richer information such as punctuation, casing and text context, which brings more flexibility for system building. Specifically, we propose a general and efficient pipeline to locate, align and segment the audios in previously published Librilight to its corresponding texts. The same as Librilight, Libriheavy also has three training subsets small, medium, large of the sizes 500h, 5000h, 50000h respectively. We also extract the dev and test evaluation sets from the aligned audios and guarantee there is no overlapping speakers and books in training sets. Baseline systems are built on the popular CTC-Attention and transducer models. Additionally, we open-source our dataset creatation pipeline which can also be used to other audio alignment tasks.
翻译:本文介绍了Libriheavy,一个大规模ASR语料库,包含源自LibriVox的5万小时英语朗读语音。据我们所知,Libriheavy是最大规模可免费获取的带有监督信息的语音语料库。与仅提供规范化转录文本的其他开源数据集不同,Libriheavy包含更丰富的信息(如标点符号、大小写和文本上下文),为系统构建提供了更高的灵活性。具体而言,我们提出了一种通用且高效的流水线方法,用于定位、对齐和切分此前发布的Librilight语料库中的音频与其对应文本。与Librilight相同,Libriheavy也包含三个训练子集(small/medium/large),规模分别为500小时、5000小时和5万小时。我们还从对齐后的音频中提取了开发集和测试集,并确保训练集中没有重叠的说话人和书籍。基于流行的CTC-Attention和Transducer模型构建了基线系统。此外,我们开源了数据集创建流水线,该流水线也可用于其他音频对齐任务。