We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that allows control over speaker identity using natural language descriptions. To control speaker identity within the prompt-based TTS framework, we introduce the concept of speaker prompt, which describes voice characteristics (e.g., gender-neutral, young, old, and muffled) designed to be approximately independent of speaking style. Since there is no large-scale dataset containing speaker prompts, we first construct a dataset based on the LibriTTS-R corpus with manually annotated speaker prompts. We then employ a diffusion-based acoustic model with mixture density networks to model diverse speaker factors in the training data. Unlike previous studies that rely on style prompts describing only a limited aspect of speaker individuality, such as pitch, speaking speed, and energy, our method utilizes an additional speaker prompt to effectively learn the mapping from natural language descriptions to the acoustic features of diverse speakers. Our subjective evaluation results show that the proposed method can better control speaker characteristics than the methods without the speaker prompt. Audio samples are available at https://reppy4620.github.io/demo.promptttspp/.
翻译:我们提出PromptTTS++,一种基于提示的文本到语音(TTS)合成系统,允许使用自然语言描述控制说话人身份。为了在基于提示的TTS框架内控制说话人身份,我们引入了说话人提示的概念,该提示描述语音特征(例如,性别中立、年轻、年老和低沉),其设计旨在与说话风格近似无关。由于缺乏包含说话人提示的大规模数据集,我们首先基于LibriTTS-R语料库构建了一个数据集,并带有手动标注的说话人提示。然后,我们采用具有混合密度网络的扩散声学模型来建模训练数据中的多样化说话人因素。与先前仅依赖描述说话人个性有限方面(如基频、语速和能量)的风格提示的研究不同,我们的方法利用额外的说话人提示有效学习从自然语言描述到多样说话人声学特征的映射。我们的主观评估结果表明,与不使用说话人提示的方法相比,所提方法能更好地控制说话人特征。音频样本请见https://reppy4620.github.io/demo.promptttspp/。