Text-to-Speech (TTS) models have advanced significantly, aiming to accurately replicate human speech's diversity, including unique speaker identities and linguistic nuances. Despite these advancements, achieving an optimal balance between speaker-fidelity and text-intelligibility remains a challenge, particularly when diverse control demands are considered. Addressing this, we introduce DualSpeech, a TTS model that integrates phoneme-level latent diffusion with dual classifier-free guidance. This approach enables exceptional control over speaker-fidelity and text-intelligibility. Experimental results demonstrate that by utilizing the sophisticated control, DualSpeech surpasses existing state-of-the-art TTS models in performance. Demos are available at https://bit.ly/48Ewoib.
翻译:文本转语音(TTS)模型已取得显著进展,旨在精确复现人类语音的多样性,包括独特的说话人身份和语言细微差别。尽管取得了这些进步,但在说话人保真度与文本可懂度之间实现最佳平衡仍然是一个挑战,尤其是在考虑多样化控制需求时。针对此问题,我们提出了DualSpeech,一种将音素级潜在扩散与双重无分类器引导相结合的TTS模型。该方法实现了对说话人保真度和文本可懂度的卓越控制。实验结果表明,通过利用这种精细控制,DualSpeech在性能上超越了现有的先进TTS模型。演示可在 https://bit.ly/48Ewoib 获取。