While Current TTS systems perform well in synthesizing high-quality speech, producing highly expressive speech remains a challenge. Emphasis, as a critical factor in determining the expressiveness of speech, has attracted more attention nowadays. Previous works usually enhance the emphasis by adding intermediate features, but they can not guarantee the overall expressiveness of the speech. To resolve this matter, we propose Emphatic Expressive TTS (EE-TTS), which leverages multi-level linguistic information from syntax and semantics. EE-TTS contains an emphasis predictor that can identify appropriate emphasis positions from text and a conditioned acoustic model to synthesize expressive speech with emphasis and linguistic information. Experimental results indicate that EE-TTS outperforms baseline with MOS improvements of 0.49 and 0.67 in expressiveness and naturalness. EE-TTS also shows strong generalization across different datasets according to AB test results.
翻译:当前TTS系统在合成高质量语音方面表现良好,但生成高表现力语音仍具挑战性。强调作为决定语音表现力的关键因素,近年来受到更多关注。以往研究通常通过添加中间特征来增强强调效果,但无法保证语音的整体表现力。为解决这一问题,我们提出强调性表现力TTS系统(EE-TTS),该模型利用来自句法和语义的多层级语言信息。EE-TTS包含一个可识别文本中适当强调位置的强调预测器,以及一个基于条件声学模型来合成带有强调与语言信息的表现力语音的模型。实验结果表明,EE-TTS在表现力和自然度方面分别以MOS值提升0.49和0.67的优势优于基线模型。根据AB测试结果,EE-TTS在不同数据集上还展现出强大的泛化能力。