Conversational text-to-speech (TTS) aims to synthesize speech with proper prosody of reply based on the historical conversation. However, it is still a challenge to comprehensively model the conversation, and a majority of conversational TTS systems only focus on extracting global information and omit local prosody features, which contain important fine-grained information like keywords and emphasis. Moreover, it is insufficient to only consider the textual features, and acoustic features also contain various prosody information. Hence, we propose M2-CTTS, an end-to-end multi-scale multi-modal conversational text-to-speech system, aiming to comprehensively utilize historical conversation and enhance prosodic expression. More specifically, we design a textual context module and an acoustic context module with both coarse-grained and fine-grained modeling. Experimental results demonstrate that our model mixed with fine-grained context information and additionally considering acoustic features achieves better prosody performance and naturalness in CMOS tests.
翻译:对话文本转语音的目标是基于历史对话生成具有恰当韵律的回复语音。然而,全面建模对话仍具挑战性,现有对话语音合成系统多聚焦于提取全局信息,忽略了包含关键词与重音等重要细粒度特征的局部韵律信息。此外,仅考虑文本特征存在局限性,声学特征同样蕴含丰富韵律信息。为此,我们提出M2-CTTS——一种端到端多尺度多模态对话语音合成系统,旨在综合利用历史对话并增强韵律表达。具体而言,我们设计了兼具粗粒度与细粒度建模能力的文本上下文模块与声学上下文模块。实验结果表明,融合细粒度上下文信息并额外引入声学特征的模型在CMOS测试中取得了更优的韵律表现与自然度。