In this paper, we introduce Libriheavy, a large-scale ASR corpus consisting of 50,000 hours of read English speech derived from LibriVox. To the best of our knowledge, Libriheavy is the largest freely-available corpus of speech with supervisions. Different from other open-sourced datasets that only provide normalized transcriptions, Libriheavy contains richer information such as punctuation, casing and text context, which brings more flexibility for system building. Specifically, we propose a general and efficient pipeline to locate, align and segment the audios in previously published Librilight to its corresponding texts. The same as Librilight, Libriheavy also has three training subsets small, medium, large of the sizes 500h, 5000h, 50000h respectively. We also extract the dev and test evaluation sets from the aligned audios and guarantee there is no overlapping speakers and books in training sets. Baseline systems are built on the popular CTC-Attention and transducer models. Additionally, we open-source our dataset creatation pipeline which can also be used to other audio alignment tasks.
翻译:本文介绍了Libriheavy,这是一个大规模ASR语料库,包含来自LibriVox的50,000小时朗读英语语音。据我们所知,Libriheavy是规模最大的免费公开且带有标注的语音语料库。与仅提供标准化转录的其他开源数据集不同,Libriheavy包含更丰富的信息,如标点、大小写和文本上下文,为系统构建提供了更大的灵活性。具体而言,我们提出了一种通用且高效的流水线,用于定位、对齐和分割先前发布的Librilight音频至其对应的文本。与Librilight相同,Libriheavy也包含三个训练子集:小(500小时)、中(5000小时)和大(50000小时)。我们还从对齐后的音频中提取了开发集和测试集,确保训练集中没有发言人及书籍的重叠。基于主流的CTC-Attention和转导模型构建了基线系统。此外,我们开源了数据集创建流水线,该流水线也可用于其他音频对齐任务。