While Current TTS systems perform well in synthesizing high-quality speech, producing highly expressive speech remains a challenge. Emphasis, as a critical factor in determining the expressiveness of speech, has attracted more attention nowadays. Previous works usually enhance the emphasis by adding intermediate features, but they can not guarantee the overall expressiveness of the speech. To resolve this matter, we propose Emphatic Expressive TTS (EE-TTS), which leverages multi-level linguistic information from syntax and semantics. EE-TTS contains an emphasis predictor that can identify appropriate emphasis positions from text and a conditioned acoustic model to synthesize expressive speech with emphasis and linguistic information. Experimental results indicate that EE-TTS outperforms baseline with MOS improvements of 0.49 and 0.67 in expressiveness and naturalness. EE-TTS also shows strong generalization across different datasets according to AB test results.
翻译:当前TTS系统在合成高质量语音方面表现良好,但生成高度表达性的语音仍具挑战。强调作为决定语音表达力的关键因素,近年来受到更多关注。以往研究通常通过添加中间特征来增强强调效果,但无法保证语音的整体表现力。为解决此问题,我们提出Emphatic Expressive TTS(EE-TTS),该模型充分利用来自句法和语义的多层次语言信息。EE-TTS包含一个强调预测器,可从文本中识别合适的强调位置,以及一个条件声学模型,用于合成包含强调和语言信息的表达性语音。实验结果表明,EE-TTS在表现力和自然度上的MOS评分分别比基线提升0.49和0.67。根据AB测试结果,EE-TTS在不同数据集上还展现出强大的泛化能力。