Text-to-speech models trained on large-scale datasets have demonstrated impressive in-context learning capabilities and naturalness. However, control of speaker identity and style in these models typically requires conditioning on reference speech recordings, limiting creative applications. Alternatively, natural language prompting of speaker identity and style has demonstrated promising results and provides an intuitive method of control. However, reliance on human-labeled descriptions prevents scaling to large datasets. Our work bridges the gap between these two approaches. We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions. We then apply this method to a 45k hour dataset, which we use to train a speech language model. Furthermore, we propose simple methods for increasing audio fidelity, significantly outperforming recent work despite relying entirely on found data. Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions, all accomplished with a single model and intuitive natural language conditioning. Audio samples can be heard at https://text-description-to-speech.com/.
翻译:大规模数据集训练的文本转语音模型已展现出惊人的上下文学习能力和自然度。然而,这类模型对说话人身份和风格的控制通常需要依赖参考语音录音作为条件,这限制了创意应用。另一方面,通过自然语言提示控制说话人身份和风格的方法已展现出令人瞩目的成果,并提供了直观的控制方式。但这类方法依赖人工标注描述,难以扩展至大规模数据集。本研究弥合了这两种方法的差距。我们提出了一种可扩展的标注方法,用于标注说话人身份、风格及录音条件的多维度特征。我们将该方法应用于一个4.5万小时的数据集,并用其训练语音语言模型。此外,我们提出了提升音频保真度的简单方法,在完全依赖现有数据的情况下显著超越近期研究。实验结果表明,我们仅用单一模型搭配直观的自然语言条件控制,即可生成涵盖多样口音、韵律风格、信道条件和声学条件的高保真语音。音频样本可访问 https://text-description-to-speech.com/ 试听。