Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. Specifically, our proposed approach integrates the utterance-level emotion embedding extracted by SER with fine-grained frame-level emotion embedding obtained from SED. These embeddings are used to condition the reverse process of the denoising diffusion probabilistic model (DDPM). Additionally, we employ cross-domain SED to accurately predict soft labels, addressing the challenge of a scarcity of fine-grained emotion-annotated datasets for supervising emotional TTS training.
翻译:现有情感语音合成方法通常利用从参考音频中提取的语句级风格嵌入,忽视了语音韵律固有的多尺度特性。本文提出ED-TTS——一种多尺度情感语音合成模型,通过结合语音情感分割(SED)与语音情感识别(SER)实现不同层级的情感建模。具体而言,本方法将SER提取的语句级情感嵌入与SED获得的细粒度帧级情感嵌入进行融合,并以此作为条件约束去噪扩散概率模型(DDPM)的逆向过程。此外,我们采用跨域SED准确预测软标签,以解决细粒度情感标注数据集稀缺对情感语音合成训练的制约。