Speech emotion conversion is the task of converting the expressed emotion of a spoken utterance to a target emotion while preserving the lexical content and speaker identity. While most existing works in speech emotion conversion rely on acted-out datasets and parallel data samples, in this work we specifically focus on more challenging in-the-wild scenarios and do not rely on parallel data. To this end, we propose a diffusion-based generative model for speech emotion conversion, the EmoConv-Diff, that is trained to reconstruct an input utterance while also conditioning on its emotion. Subsequently, at inference, a target emotion embedding is employed to convert the emotion of the input utterance to the given target emotion. As opposed to performing emotion conversion on categorical representations, we use a continuous arousal dimension to represent emotions while also achieving intensity control. We validate the proposed methodology on a large in-the-wild dataset, the MSP-Podcast v1.10. Our results show that the proposed diffusion model is indeed capable of synthesizing speech with a controllable target emotion. Crucially, the proposed approach shows improved performance along the extreme values of arousal and thereby addresses a common challenge in the speech emotion conversion literature.
翻译:语音情感转换是将语音片段的表达情感转换为目标情感,同时保留词汇内容和说话人身份的任务。现有研究多依赖基于表演的数据集和平行数据样本,而本文专门聚焦于更具挑战性的真实场景,且不依赖平行数据。为此,我们提出一种基于扩散模型的语音情感转换生成式模型EmoConv-Diff,该模型以输入语音为重建目标,并以其情感为条件进行训练。在推理阶段,通过引入目标情感嵌入,将输入语音的情感转换为指定目标情感。与基于分类表征进行情感转换不同,我们采用连续唤醒度维度表示情感,并实现强度控制。我们在大规模真实场景数据集MSP-Podcast v1.10上验证了所提出的方法。结果表明,该扩散模型能够合成具有可控目标情感的语音。关键在于,所提方法在唤醒度极端值上表现出更优性能,从而解决了语音情感转换领域的一个常见挑战。