Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years due to its potential to enhance human-computer interaction. However, current E-TTS approaches often struggle to capture the complexity of human emotions, primarily relying on oversimplified emotional labels or single-modality inputs. To address these limitations, we propose the Multimodal Emotional Text-to-Speech System (MM-TTS), a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. MM-TTS consists of two key components: (1) the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information; and (2) the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions. Extensive evaluations across diverse datasets demonstrate the superior performance of MM-TTS compared to traditional E-TTS models. Objective metrics, including Word Error Rate (WER) and Character Error Rate (CER), show significant improvements on ESD dataset, with MM-TTS achieving scores of 7.35% and 3.07%, respectively. Subjective assessments further validate that MM-TTS generates speech with emotional fidelity and naturalness comparable to human speech. Our code and pre-trained models are publicly available at https://anonymous.4open.science/r/MMTTS-D214
翻译:情感文本语音合成(E-TTS)近年来因其提升人机交互的潜力而备受关注。然而,现有E-TTS方法往往难以捕捉人类情感的复杂性,主要依赖过于简化的情感标签或单一模态输入。为克服这些局限,我们提出多模态情感文本语音系统(MM-TTS),这是一个统一框架,能够利用来自多种模态的情感线索生成富有表现力和情感共鸣的语音。MM-TTS包含两个关键组件:(1)情感提示对齐模块(EP-Align),通过对比学习对齐文本、音频和视觉模态间的情感特征,确保多模态信息的连贯融合;(2)情感嵌入诱导TTS(EMI-TTS),将对齐后的情感嵌入与最先进的TTS模型相结合,以合成准确反映目标情感的语音。在多个数据集上的广泛评估表明,与传统E-TTS模型相比,MM-TTS具有更优性能。客观指标方面,在ESD数据集上,MM-TTS的词错误率(WER)和字符错误率(CER)分别达到7.35%和3.07%,实现显著改进。主观评估进一步证实,MM-TTS生成的语音在情感保真度和自然度方面与人类语音相当。我们的代码和预训练模型已公开在https://anonymous.4open.science/r/MMTTS-D214。