Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timbre is a global attribute closely linked to expressiveness, we adopt a global vector to model speaker timbre while guiding prosody modeling. Besides, given that prosody contains both global consistency and local variations, we introduce a diffusion model as the pitch predictor and employ a prosody adaptor to model prosody hierarchically, further enhancing the prosody quality of the synthesized speech. Experimental results show that our model not only maintains comparable timbre quality to the baseline but also exhibits better naturalness and expressiveness.
翻译:近期零样本语音合成研究在说话人音色相似度方面取得了显著进展。然而,当前研究主要关注音色泛化而非韵律建模,导致合成语音的自然度与表现力受限。为解决此问题,我们提出一种基于大规模数据集训练的新型语音合成模型,该模型同时包含音色建模与层次化韵律建模。鉴于音色是与表现力紧密关联的全局属性,我们采用全局向量对说话人音色进行建模,并以此指导韵律建模。此外,考虑到韵律同时包含全局一致性与局部变化性,我们引入扩散模型作为基频预测器,并采用韵律适配器实现层次化韵律建模,从而进一步提升合成语音的韵律质量。实验结果表明,我们的模型在保持与基线模型相当音色质量的同时,展现出更优的自然度与表现力。