Recently, text-guided content generation has received extensive attention. In this work, we explore the possibility of text description-based speaker generation, i.e., using text prompts to control the speaker generation process. Specifically, we propose PromptSpeaker, a text-guided speaker generation system. PromptSpeaker consists of a prompt encoder, a zero-shot VITS, and a Glow model, where the prompt encoder predicts a prior distribution based on the text description and samples from this distribution to obtain a semantic representation. The Glow model subsequently converts the semantic representation into a speaker representation, and the zero-shot VITS finally synthesizes the speaker's voice based on the speaker representation. We verify that PromptSpeaker can generate speakers new from the training set by objective metrics, and the synthetic speaker voice has reasonable subjective matching quality with the speaker prompt.
翻译:摘要:近年来,文本引导的内容生成受到了广泛关注。本文探索了基于文本描述的语音生成的可能性,即利用文本提示控制语音生成过程。具体而言,我们提出了PromptSpeaker,一个文本引导的语音生成系统。该系统由提示编码器、零样本VITS和Glow模型组成,其中提示编码器根据文本描述预测先验分布,并通过从该分布中采样获得语义表示;随后,Glow模型将语义表示转换为说话人表示,最终由零样本VITS基于说话人表示合成说话人的声音。通过客观指标验证,PromptSpeaker能够生成训练集中未出现的新说话人,且合成的说话人声音与说话人提示具有合理的主观匹配质量。