Recently, speech generation models have made significant progress by using large-scale training data. However, the research community struggle to produce highly spontaneous and human-like speech due to the lack of large-scale, diverse, and spontaneous speech data. This paper presents \textit{Emilia}, the first multilingual speech generation dataset from in-the-wild speech data, and Emilia-Pipe, the first open-source preprocessing pipeline designed to transform in-the-wild speech data into high-quality training data with annotations for speech generation. Emilia starts with over 101k hours of speech in six languages and features diverse speech with varied speaking styles. To facilitate the scale-up of Emilia, the open-source pipeline Emilia-Pipe can process one hour of raw speech data ready for model training in a few mins, which enables the research community to collaborate on large-scale speech generation research. Experimental results validate the effectiveness of Emilia. Demos are available at: https://emilia-dataset.github.io/Emilia-Demo-Page/.
翻译:近期,语音生成模型通过使用大规模训练数据取得了显著进展。然而,由于缺乏大规模、多样化且具有自发性的语音数据,研究界在生成高度自发且类人语音方面仍面临挑战。本文提出了\textit{Emilia},这是首个基于真实场景语音数据的多语言语音生成数据集,以及Emilia-Pipe,这是首个旨在将真实场景语音数据转化为带有语音生成标注的高质量训练数据的开源预处理流程。Emilia起始于包含六种语言、总计超过101千小时的语音数据,并具有多样化的语音和不同的说话风格。为促进Emilia的规模扩展,开源流程Emilia-Pipe可在数分钟内处理一小时的原始语音数据,使其准备好用于模型训练,这使研究界能够协作开展大规模语音生成研究。实验结果验证了Emilia的有效性。演示样例可见于:https://emilia-dataset.github.io/Emilia-Demo-Page/。