High quality transcription data is crucial for training automatic speech recognition (ASR) systems. However, the existing industry-level data collection pipelines are expensive to researchers, while the quality of crowdsourced transcription is low. In this paper, we propose a reliable method to collect speech transcriptions. We introduce two mechanisms to improve transcription quality: confidence estimation based reprocessing at labeling stage, and automatic word error correction at post-labeling stage. We collect and release LibriCrowd - a large-scale crowdsourced dataset of audio transcriptions on 100 hours of English speech. Experiment shows the Transcription WER is reduced by over 50%. We further investigate the impact of transcription error on ASR model performance and found a strong correlation. The transcription quality improvement provides over 10% relative WER reduction for ASR models. We release the dataset and code to benefit the research community.
翻译:高质量的转录数据对于训练自动语音识别(ASR)系统至关重要。然而,现有的工业级数据采集流水线对研究人员而言成本高昂,而众包转录的质量却较低。本文提出了一种可靠的语音转录数据收集方法。我们引入了两种改进转录质量的机制:标注阶段的基于置信度估计的再处理,以及标注后阶段的自动单词错误纠正。我们收集并发布了LibriCrowd——一个基于100小时英语语音的大规模众包音频转录数据集。实验表明,转录词错误率(WER)降低了50%以上。我们进一步研究了转录错误对ASR模型性能的影响,发现两者之间存在强相关性。转录质量的提升为ASR模型带来了超过10%的相对词错误率降低。我们公开了数据集和代码,以惠及研究社区。