While most research into speech synthesis has focused on synthesizing high-quality speech for in-dataset speakers, an equally essential yet unsolved problem is synthesizing speech for unseen speakers who are out-of-dataset with limited reference data, i.e., speaker adaptive speech synthesis. Many studies have proposed zero-shot speaker adaptive text-to-speech and voice conversion approaches aimed at this task. However, most current approaches suffer from the degradation of naturalness and speaker similarity when synthesizing speech for unseen speakers (i.e., speakers not in the training dataset) due to the poor generalizability of the model in out-of-distribution data. To address this problem, we propose GZS-TV, a generalizable zero-shot speaker adaptive text-to-speech and voice conversion model. GZS-TV introduces disentangled representation learning for both speaker embedding extraction and timbre transformation to improve model generalization and leverages the representation learning capability of the variational autoencoder to enhance the speaker encoder. Our experiments demonstrate that GZS-TV reduces performance degradation on unseen speakers and outperforms all baseline models in multiple datasets.
翻译:尽管大多数语音合成研究集中于为数据集内说话人合成高质量语音,但为数据集外未见说话人(即说话人自适应语音合成)在有限参考数据下合成语音同样重要且尚未解决。许多研究提出了面向该任务的零样本说话人自适应文本转语音和语音转换方法。然而,当前多数方法因模型在分布外数据上的泛化能力不足,在合成未见说话人(即训练数据集中未包含的说话人)语音时会出现自然度和说话人相似度下降的问题。为解决此问题,我们提出GZS-TV——一种可泛化的零样本说话人自适应文本转语音与语音转换模型。GZS-TV在说话人嵌入提取和音色变换中引入解耦表示学习以提升模型泛化能力,并利用变分自编码器的表示学习能力增强说话人编码器。实验表明,GZS-TV能减少在未见说话人上的性能退化,并在多个数据集上优于所有基线模型。