Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly. We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation that enables text-conditioned vocal range control while keeping melodic accuracy. Furthermore, we explore various experiment settings, including different types of text representations, text encoder fine-tuning, and introducing speech data to alleviate data scarcity, aiming to facilitate further research. Experiments show that our model achieves favorable controlling ability and audio quality. Audio samples are available at http://prompt-singer.github.io .
翻译:近期歌声合成方法已实现出色的音频质量与自然度,但缺乏对合成歌声风格属性的显式控制能力。我们提出Prompt-Singer,这是首个通过自然语言实现对歌手性别、音域及音量等属性控制的歌声合成方法。我们采用基于仅解码器Transformer(解码器专用架构)的多层级模型设计,并创新性地提出音域-旋律解耦的音高表征,在保持旋律精度的同时实现文本条件化的音域控制。此外,我们探索了包含文本表征类型、文本编码器微调及引入语音数据缓解数据稀缺性在内的多种实验设置,旨在推动后续研究。实验表明,本模型兼具优异的控制能力与音频质量。音频样本见http://prompt-singer.github.io。