Although text-to-speech (TTS) systems have significantly improved, most TTS systems still have limitations in synthesizing speech with appropriate phrasing. For natural speech synthesis, it is important to synthesize the speech with a phrasing structure that groups words into phrases based on semantic information. In this paper, we propose PuaseSpeech, a speech synthesis system with a pre-trained language model and pause-based prosody modeling. First, we introduce a phrasing structure encoder that utilizes a context representation from the pre-trained language model. In the phrasing structure encoder, we extract a speaker-dependent syntactic representation from the context representation and then predict a pause sequence that separates the input text into phrases. Furthermore, we introduce a pause-based word encoder to model word-level prosody based on pause sequence. Experimental results show PauseSpeech outperforms previous models in terms of naturalness. Furthermore, in terms of objective evaluations, we can observe that our proposed methods help the model decrease the distance between ground-truth and synthesized speech. Audio samples are available at https://jisang93.github.io/pausespeech-demo/.
翻译:虽然文本到语音(TTS)系统已取得显著进展,但大多数TTS系统在合成具有恰当短语划分的语音时仍存在局限。为实现自然语音合成,关键是根据语义信息将单词按短语结构进行分组。本文提出PauseSpeech——一种采用预训练语言模型与基于停顿的韵律建模的语音合成系统。首先,我们引入短语结构编码器,该编码器利用预训练语言模型的上下文表征。在短语结构编码器中,我们从上下文表征中提取说话人相关的句法表征,进而预测将输入文本划分为短语的停顿序列。此外,我们提出基于停顿的词编码器,依据停顿序列对单词级韵律进行建模。实验结果表明,PauseSpeech在自然度方面优于先前模型。在客观评估方面,可观察到所提方法有助于缩短合成语音与真实语音之间的差距。音频样本请访问 https://jisang93.github.io/pausespeech-demo/。