While current emotional text-to-speech (TTS) systems can generate highly intelligible emotional speech, achieving fine control over emotion rendering of the output speech still remains a significant challenge. In this paper, we introduce ParaEVITS, a novel emotional TTS framework that leverages the compositionality of natural language to enhance control over emotional rendering. By incorporating a text-audio encoder inspired by ParaCLAP, a contrastive language-audio pretraining (CLAP) model for computational paralinguistics, the diffusion model is trained to generate emotional embeddings based on textual emotional style descriptions. Our framework first trains on reference audio using the audio encoder, then fine-tunes a diffusion model to process textual inputs from ParaCLAP's text encoder. During inference, speech attributes such as pitch, jitter, and loudness are manipulated using only textual conditioning. Our experiments demonstrate that ParaEVITS effectively control emotion rendering without compromising speech quality. Speech demos are publicly available.
翻译:尽管当前的情感文本到语音系统能够生成高度清晰的情感语音,但实现对输出语音情感渲染的精细控制仍然是一个重大挑战。本文提出ParaEVITS,一种新颖的情感TTS框架,利用自然语言的组合性来增强对情感渲染的控制。通过引入受ParaCLAP启发的文本-音频编码器——一种用于计算副语言学的对比语言-音频预训练模型,扩散模型被训练为基于文本情感风格描述生成情感嵌入。我们的框架首先使用音频编码器在参考音频上进行训练,然后微调扩散模型以处理来自ParaCLAP文本编码器的文本输入。在推理过程中,仅通过文本条件即可操纵语音属性,如音高、抖动和响度。实验表明,ParaEVITS在保持语音质量的同时,能有效控制情感渲染。语音演示已公开提供。