Although current Text-To-Speech (TTS) models are able to generate high-quality speech samples, there are still challenges in developing emotion intensity controllable TTS. Most existing TTS models achieve emotion intensity control by extracting intensity information from reference speeches. Unfortunately, limited by the lack of modeling for intra-class emotion intensity and the model's information decoupling capability, the generated speech cannot achieve fine-grained emotion intensity control and suffers from information leakage issues. In this paper, we propose an emotion transfer TTS model, which defines a remapping-based sorting method to model intra-class relative intensity information, combined with Mutual Information (MI) to decouple speaker and emotion information, and synthesizes expressive speeches with perceptible intensity differences. Experiments show that our model achieves fine-grained emotion control while preserving speaker information.
翻译:尽管当前文本转语音模型已能生成高质量语音样本,但开发情感强度可控的语音合成系统仍面临挑战。现有多数模型通过从参考语音中提取强度信息来实现情感强度控制。然而,由于缺乏对类内情感强度的建模以及模型信息解耦能力的局限,生成语音难以实现细粒度情感强度控制,并存在信息泄露问题。本文提出一种情感迁移语音合成模型,通过定义基于重映射的排序方法来建模类内相对强度信息,结合互信息解耦说话人与情感信息,从而合成具有可感知强度差异的表现力语音。实验表明,该模型在保持说话人信息的同时实现了细粒度的情感控制。