Recently, the utilization of extensive open-sourced text data has significantly advanced the performance of text-based large language models (LLMs). However, the use of in-the-wild large-scale speech data in the speech technology community remains constrained. One reason for this limitation is that a considerable amount of the publicly available speech data is compromised by background noise, speech overlapping, lack of speech segmentation information, missing speaker labels, and incomplete transcriptions, which can largely hinder their usefulness. On the other hand, human annotation of speech data is both time-consuming and costly. To address this issue, we introduce an automatic in-the-wild speech data preprocessing framework (AutoPrep) in this paper, which is designed to enhance speech quality, generate speaker labels, and produce transcriptions automatically. The proposed AutoPrep framework comprises six components: speech enhancement, speech segmentation, speaker clustering, target speech extraction, quality filtering and automatic speech recognition. Experiments conducted on the open-sourced WenetSpeech and our self-collected AutoPrepWild corpora demonstrate that the proposed AutoPrep framework can generate preprocessed data with similar DNSMOS and PDNSMOS scores compared to several open-sourced TTS datasets. The corresponding TTS system can achieve up to 0.68 in-domain speaker similarity.
翻译:近年来,大规模开源文本数据的应用显著提升了基于文本的大语言模型(LLM)的性能。然而,在语音技术领域,野外大规模语音数据的利用仍受到诸多限制。造成这一局限的重要原因在于:大量公开语音数据存在背景噪声、语音重叠、缺乏语音切分信息、缺失说话人标签以及转写不完整等问题,这些缺陷极大制约了数据的可用性。另一方面,人工标注语音数据既耗时又昂贵。为解决这一问题,本文提出一种自动化的野外语音数据预处理框架AutoPrep,该框架旨在自动提升语音质量、生成说话人标签并产生转写文本。所提AutoPrep框架包含六个模块:语音增强、语音切分、说话人聚类、目标语音提取、质量过滤与自动语音识别。在开源数据集WenetSpeech及自建数据集AutoPrepWild上的实验表明,所提AutoPrep框架生成的预处理数据在DNSMOS和PDNSMOS得分上与多个开源TTS数据集相当。对应的TTS系统可实现高达0.68的域内说话人相似度。