For text-to-speech (TTS) synthesis, prosodic structure prediction (PSP) plays an important role in producing natural and intelligible speech. Although inter-utterance linguistic information can influence the speech interpretation of the target utterance, previous works on PSP mainly focus on utilizing intrautterance linguistic information of the current utterance only. This work proposes to use inter-utterance linguistic information to improve the performance of PSP. Multi-level contextual information, which includes both inter-utterance and intrautterance linguistic information, is extracted by a hierarchical encoder from character level, utterance level and discourse level of the input text. Then a multi-task learning (MTL) decoder predicts prosodic boundaries from multi-level contextual information. Objective evaluation results on two datasets show that our method achieves better F1 scores in predicting prosodic word (PW), prosodic phrase (PPH) and intonational phrase (IPH). It demonstrates the effectiveness of using multi-level contextual information for PSP. Subjective preference tests also indicate the naturalness of synthesized speeches are improved.
翻译:对于文本到语音合成而言,韵律结构预测在生成自然且可理解的语言中起着重要作用。尽管跨语句的语言信息会影响目标语句的语音解释,但以往的韵律结构预测研究主要仅利用当前语句内的语言信息。本研究提出利用跨语句的语言信息来提升韵律结构预测的性能。通过层级编码器从输入文本的字级、语句级和篇章级提取包含跨语句和语句内语言信息的多层级上下文信息。随后,采用多任务学习解码器从多层级上下文信息中预测韵律边界。在两个数据集上的客观评估结果表明,本方法在预测韵律词、韵律短语和语调短语时取得了更优的F1分数,验证了多层级上下文信息在韵律结构预测中的有效性。主观偏好测试也表明,合成语音的自然度得到了提升。