Human speech exhibits rich and flexible prosodic variations. To address the one-to-many mapping problem from text to prosody in a reasonable and flexible manner, we propose DiffStyleTTS, a multi-speaker acoustic model based on a conditional diffusion module and an improved classifier-free guidance, which hierarchically models speech prosodic features, and controls different prosodic styles to guide prosody prediction. Experiments show that our method outperforms all baselines in naturalness and achieves superior synthesis speed compared to three diffusion-based baselines. Additionally, by adjusting the guiding scale, DiffStyleTTS effectively controls the guidance intensity of the synthetic prosody.
翻译:人类语音表现出丰富而灵活的韵律变化。为了以合理且灵活的方式解决从文本到韵律的一对多映射问题,我们提出了DiffStyleTTS,这是一个基于条件扩散模块和改进的无分类器引导的多说话人声学模型。该模型分层建模语音韵律特征,并控制不同的韵律风格以引导韵律预测。实验表明,我们的方法在自然度上优于所有基线模型,并且与三个基于扩散的基线模型相比,实现了更优的合成速度。此外,通过调整引导尺度,DiffStyleTTS能够有效控制合成韵律的引导强度。