While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-Singer, a high-quality open-source SVS system designed with practical deployment considerations in mind. SoulX-Singer supports controllable singing generation conditioned on either symbolic musical scores (MIDI) or melodic representations, enabling flexible and expressive control in real-world production workflows. Trained on more than 42,000 hours of vocal data, the system supports Mandarin Chinese, English, and Cantonese and consistently achieves state-of-the-art synthesis quality across languages under diverse musical conditions. Furthermore, to enable reliable evaluation of zero-shot SVS performance in practical scenarios, we construct SoulX-Singer-Eval, a dedicated benchmark with strict training-test disentanglement, facilitating systematic assessment in zero-shot settings.
翻译:尽管近年来语音合成领域取得了快速进展,但开源歌声合成系统在工业部署方面仍面临显著障碍,尤其是在鲁棒性和零样本泛化能力方面。本报告介绍了SoulX-Singer,一个兼顾实际部署考量而设计的高质量开源歌声合成系统。SoulX-Singer支持基于符号乐谱或旋律表征的可控歌声生成,能够在实际生产流程中实现灵活且富有表现力的控制。该系统在超过42,000小时的人声数据上进行训练,支持普通话、英语和粤语,并在多样化的音乐条件下持续实现跨语言的顶尖合成质量。此外,为在实际场景中可靠评估零样本歌声合成性能,我们构建了具有严格训练-测试解耦特性的专用基准SoulX-Singer-Eval,以促进零样本场景下的系统性评估。