Style transfer TTS has shown impressive performance in recent years. However, style control is often restricted to systems built on expressive speech recordings with discrete style categories. In practical situations, users may be interested in transferring style by typing text descriptions of desired styles, without the reference speech in the target style. The text-guided content generation techniques have drawn wide attention recently. In this work, we explore the possibility of controllable style transfer with natural language descriptions. To this end, we propose PromptStyle, a text prompt-guided cross-speaker style transfer system. Specifically, PromptStyle consists of an improved VITS and a cross-modal style encoder. The cross-modal style encoder constructs a shared space of stylistic and semantic representation through a two-stage training process. Experiments show that PromptStyle can achieve proper style transfer with text prompts while maintaining relatively high stability and speaker similarity. Audio samples are available in our demo page.
翻译:风格迁移文本语音合成近年展现了出色的性能。然而,现有风格控制通常局限于基于表达性语音录音构建的系统,支持离散风格类别。在实际应用场景中,用户可能更希望直接输入目标风格的文本描述来实现风格迁移,而无需提供对应风格的参考语音。文本引导的内容生成技术近期广受关注。本文探索了基于自然语言描述实现可控风格迁移的可能性,并据此提出PromptStyle——一个文本提示引导的跨说话人风格迁移系统。具体而言,PromptStyle由改进的VITS模型和跨模态风格编码器组成。跨模态风格编码器通过两阶段训练过程构建了风格与语义表征的共享空间。实验表明,PromptStyle在保持较高稳定性和说话人相似度的前提下,能够通过文本提示实现有效的风格迁移。音频样本可在我们的演示页面获取。