This paper aims to introduce a robust singing voice synthesis (SVS) system to produce very natural and realistic singing voices efficiently by leveraging the adversarial training strategy. On one hand, we designed simple but generic random area conditional discriminators to help supervise the acoustic model, which can effectively avoid the over-smoothed spectrogram prediction and improve the expressiveness of SVS. On the other hand, we subtly combined the spectrogram with the frame-level linearly-interpolated F0 sequence as the input for the neural vocoder, which is then optimized with the help of multiple adversarial conditional discriminators in the waveform domain and multi-scale distance functions in the frequency domain. The experimental results and ablation studies concluded that, compared with our previous auto-regressive work, our new system can produce high-quality singing voices efficiently by fine-tuning different singing datasets covering from several minutes to a few hours. A large number of synthesized songs with different timbres are available online https://zzw922cn.github.io/wesinger2 and we highly recommend readers to listen to them.
翻译:本文旨在通过对抗训练策略,构建鲁棒的歌声合成系统,以高效生成极为自然且逼真的歌声。一方面,我们设计了简单但通用的随机区域条件判别器来辅助声学模型监督训练,有效避免频谱图过度平滑预测,提升歌声合成的表现力。另一方面,我们巧妙地将频谱图与帧级线性插值F0序列结合作为神经声码器的输入,并在波形域借助多个对抗性条件判别器、在频域借助多尺度距离函数进行优化。实验结果与消融研究表明,与先前的自回归工作相比,新系统可通过微调涵盖数分钟至数小时不等的多组歌声数据集,高效生成高质量歌声。大量不同音色的合成歌曲示例已在线发布(https://zzw922cn.github.io/wesinger2),强烈建议读者试听。