State-of-the-art Text-To-Speech (TTS) models are capable of producing high-quality speech. The generated speech, however, is usually neutral in emotional expression, whereas very often one would want fine-grained emotional control of words or phonemes. Although still challenging, the first TTS models have been recently proposed that are able to control voice by manually assigning emotion intensity. Unfortunately, due to the neglect of intra-class distance, the intensity differences are often unrecognizable. In this paper, we propose a fine-grained controllable emotional TTS, that considers both inter- and intra-class distances and be able to synthesize speech with recognizable intensity difference. Our subjective and objective experiments demonstrate that our model exceeds two state-of-the-art controllable TTS models for controllability, emotion expressiveness and naturalness.
翻译:最先进的文本到语音(TTS)模型能够生成高质量的语音。然而,生成的语音通常在情感表达上呈现中性,而人们往往希望对单词或音素进行细粒度的情感控制。尽管仍具挑战性,近期已有首批TTS模型被提出,能够通过手动分配情感强度来控制语音。遗憾的是,由于忽视了类内距离,强度差异往往难以辨识。本文提出一种细粒度可控情感TTS模型,该模型同时考虑类间与类内距离,能够合成具有可辨识强度差异的语音。我们的主观与客观实验表明,该模型在可控性、情感表现力及自然度方面均超越了两个最先进的可控TTS模型。