Realistic emotional voice conversion (EVC) aims to enhance emotional diversity of converted audios, making the synthesized voices more authentic and natural. To this end, we propose Emotional Intensity-aware Network (EINet), dynamically adjusting intonation and rhythm by incorporating controllable emotional intensity. To better capture nuances in emotional intensity, we go beyond mere distance measurements among acoustic features. Instead, an emotion evaluator is utilized to precisely quantify speaker's emotional state. By employing an intensity mapper, intensity pseudo-labels are obtained to bridge the gap between emotional speech intensity modeling and run-time conversion. To ensure high speech quality while retaining controllability, an emotion renderer is used for combining linguistic features smoothly with manipulated emotional features at frame level. Furthermore, we employ a duration predictor to facilitate adaptive prediction of rhythm changes condition on specifying intensity value. Experimental results show EINet's superior performance in naturalness and diversity of emotional expression compared to state-of-the-art EVC methods.
翻译:真实感情感语音转换旨在提升转换音频的情感多样性,使合成语音更加真实自然。为此,我们提出情感强度感知网络,通过引入可控情感强度动态调整语调与节奏。为更精准捕捉情感强度的细微差异,我们超越了仅对声学特征进行距离度量的传统方法,转而采用情感评估器精确量化说话者的情感状态。通过强度映射器获取强度伪标签,以弥合情感语音强度建模与运行时转换之间的差距。为在保持可控性的同时确保高语音质量,我们使用情感渲染器在帧级别将语言学特征与经调控的情感特征平滑融合。此外,我们采用时长预测器,在指定强度值的条件下自适应预测节奏变化。实验结果表明,相较于最先进的情感语音转换方法,本网络在情感表达的自然度与多样性方面均展现出优越性能。