Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to the rapid development of parallel TTS systems, but producing speech with naturalistic prosodic variations, speaking styles and emotional tones remains challenging. Moreover, since duration and speech are generated separately, parallel TTS models still have problems finding the best monotonic alignments that are crucial for naturalistic speech synthesis. Here, we propose StyleTTS, a style-based generative model for parallel TTS that can synthesize diverse speech with natural prosody from a reference speech utterance. With novel Transferable Monotonic Aligner (TMA) and duration-invariant data augmentation schemes, our method significantly outperforms state-of-the-art models on both single and multi-speaker datasets in subjective tests of speech naturalness and speaker similarity. Through self-supervised learning of the speaking styles, our model can synthesize speech with the same prosodic and emotional tone as any given reference speech without the need for explicitly labeling these categories.
翻译:文本转语音(TTS)技术因并行TTS系统的快速发展,近年来在高质量语音合成方面取得了显著进展,但生成具有自然韵律变化、说话风格和情感语调的语音仍具挑战性。此外,由于时长和语音是分开生成的,并行TTS模型在寻找对自然语音合成至关重要的最优单调对齐方面仍存在问题。本文提出StyleTTS——一种用于并行TTS的基于风格的生成模型,该模型能够从参考语音样本中合成具有自然韵律的多样化语音。通过创新的可传递单调对齐器(TMA)和时长不变数据增强方案,该方法在单说话人和多说话人数据集上的语音自然度和说话人相似性主观测试中均显著优于现有最优模型。借助说话风格的自我监督学习,该模型无需显式标注这些类别,即可合成与任意参考语音具有相同韵律和情感语调的语音。