Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly. We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation that enables text-conditioned vocal range control while keeping melodic accuracy. Furthermore, we explore various experiment settings, including different types of text representations, text encoder fine-tuning, and introducing speech data to alleviate data scarcity, aiming to facilitate further research. Experiments show that our model achieves favorable controlling ability and audio quality. Audio samples are available at http://prompt-singer.github.io .
翻译:近期的歌声合成方法在音频质量与自然度方面取得了显著进展,然而这些方法缺乏对合成歌声风格属性的显式控制能力。我们提出了Prompt-Singer,这是首个能够通过自然语言对歌手性别、音域和音量进行属性控制的歌声合成方法。我们采用基于仅解码器Transformer的多尺度层级模型架构,并设计了一种音域-旋律解耦的音高表示方法,该方法在保持旋律准确性的同时实现了文本条件化的音域控制。此外,我们探索了多种实验设置,包括不同类型的文本表示、文本编码器微调以及引入语音数据以缓解数据稀缺问题,旨在推动后续研究。实验表明,我们的模型在控制能力与音频质量方面均表现优异。音频样本可在 http://prompt-singer.github.io 获取。