State-of-the-art Text-To-Speech (TTS) models are capable of producing high-quality speech. The generated speech, however, is usually neutral in emotional expression, whereas very often one would want fine-grained emotional control of words or phonemes. Although still challenging, the first TTS models have been recently proposed that are able to control voice by manually assigning emotion intensity. Unfortunately, due to the neglect of intra-class distance, the intensity differences are often unrecognizable. In this paper, we propose a fine-grained controllable emotional TTS, that considers both inter- and intra-class distances and be able to synthesize speech with recognizable intensity difference. Our subjective and objective experiments demonstrate that our model exceeds two state-of-the-art controllable TTS models for controllability, emotion expressiveness and naturalness.
翻译:当前最先进的文本到语音(TTS)模型能够生成高质量语音。然而,生成的语音通常在情感表达上是中性的,而用户往往需要对单词或音素进行细粒度的情感控制。尽管仍具挑战性,近期已有首批TTS模型被提出,可通过手动分配情感强度来控制语音。遗憾的是,由于忽略了类内距离,强度差异往往难以区分。本文提出一种细粒度可控情感TTS模型,同时考虑类间距离与类内距离,能够合成具有可识别强度差异的语音。主观与客观实验表明,本模型在可控性、情感表现力与自然度方面均优于两种最先进的可控TTS模型。