This paper proposes a speech synthesis system that allows users to specify and control the acoustic characteristics of a speaker by means of prompts describing the speaker's traits of synthesized speech. Unlike previous approaches, our method utilizes listener impressions to construct prompts, which are easier to collect and align more naturally with everyday descriptions of speaker traits. We adopt the Low-rank Adaptation (LoRA) technique to swiftly tailor a pre-trained language model to our needs, facilitating the extraction of speaker-related traits from the prompt text. Besides, different from other prompt-driven text-to-speech (TTS) systems, we separate the prompt-to-speaker module from the multi-speaker TTS system, enhancing system flexibility and compatibility with various pre-trained multi-speaker TTS systems. Moreover, for the prompt-to-speaker characteristic module, we also compared the discriminative method and flow-matching based generative method and we found that combining both methods can help the system simultaneously capture speaker-related information from prompts better and generate speech with higher fidelity.
翻译:本文提出一种语音合成系统,允许用户通过描述合成语音说话人特征的提示来指定和控制说话人的声学特性。与以往方法不同,我们的方法利用听众印象构建提示,这类数据更易于收集,且更自然地契合日常对话人特征的描述。我们采用低秩自适应(LoRA)技术快速定制预训练语言模型,以促进从提示文本中提取说话人相关特征。此外,与其他提示驱动的文本转语音(TTS)系统不同,我们将提示到说话人模块与多说话人TTS系统分离,从而增强系统灵活性以及与各种预训练多说话人TTS系统的兼容性。进一步地,对于提示到说话人特征模块,我们还比较了判别式方法与基于流匹配的生成式方法,发现结合两种方法能帮助系统更好地从提示中捕获说话人相关信息,同时生成更高保真度的语音。