There has been a growing interest in using end-to-end acoustic models for singing voice synthesis (SVS). Typically, these models require an additional vocoder to transform the generated acoustic features into the final waveform. However, since the acoustic model and the vocoder are not jointly optimized, a gap can exist between the two models, leading to suboptimal performance. Although a similar problem has been addressed in the TTS systems by joint-training or by replacing acoustic features with a latent representation, adopting corresponding approaches to SVS is not an easy task. How to improve the joint-training of SVS systems has not been well explored. In this paper, we conduct a systematic investigation of how to better perform a joint-training of an acoustic model and a vocoder for SVS. We carry out extensive experiments and demonstrate that our joint-training strategy outperforms baselines, achieving more stable performance across different datasets while also increasing the interpretability of the entire framework.
翻译:近年来,使用端到端声学模型进行歌唱语音合成(SVS)的研究日益受到关注。通常,这些模型需要额外的声码器将生成的声学特征转换为最终波形。然而,由于声学模型与声码器并非联合优化,两者之间可能存在差距,导致性能次优。尽管在文本到语音(TTS)系统中已通过联合训练或使用潜表示替代声学特征解决了类似问题,但将相应方法应用于SVS并非易事。如何改进SVS系统的联合训练尚未得到充分探索。本文系统研究了如何更好地对SVS的声学模型与声码器进行联合训练。我们开展了大量实验,证明所提出的联合训练策略优于基线方法,能在不同数据集上实现更稳定的性能,同时提升整个框架的可解释性。