In this paper, we propose a singing voice synthesis model, Karaoker-SSL, that is trained only on text and speech data as a typical multi-speaker acoustic model. It is a low-resource pipeline that does not utilize any singing data end-to-end, since its vocoder is also trained on speech data. Karaoker-SSL is conditioned by self-supervised speech representations in an unsupervised manner. We preprocess these representations by selecting only a subset of their task-correlated dimensions. The conditioning module is indirectly guided to capture style information during training by multi-tasking. This is achieved with a Conformer-based module, which predicts the pitch from the acoustic model's output. Thus, Karaoker-SSL allows singing voice synthesis without reliance on hand-crafted and domain-specific features. There are also no requirements for text alignments or lyrics timestamps. To refine the voice quality, we employ a U-Net discriminator that is conditioned on the target speaker and follows a Diffusion GAN training scheme.
翻译:本文提出了一种歌声合成模型Karaoker-SSL,该模型仅利用文本和语音数据作为典型的多说话人声学模型进行训练。它采用低资源处理流程,全程不依赖任何歌唱数据,因为其声码器也是基于语音数据训练的。Karaoker-SSL通过无监督方式以自监督语音表征作为条件输入。我们通过仅选取与任务相关维度的子集对这些表征进行预处理。条件模块通过多任务学习在训练过程中被间接引导以捕捉风格信息,具体通过一个基于Conformer的模块实现——该模块从声学模型输出中预测基频。因此,Karaoker-SSL无需依赖人工设计的领域特定特征即可实现歌声合成,同时无需文本对齐或歌词时间戳信息。为优化音质,我们采用以目标说话人为条件的U-Net判别器,并遵循扩散生成对抗网络(Diffusion GAN)训练方案。