Speech emotion conversion is the task of converting the expressed emotion of a spoken utterance to a target emotion while preserving the lexical content and speaker identity. While most existing works in speech emotion conversion rely on acted-out datasets and parallel data samples, in this work we specifically focus on more challenging in-the-wild scenarios and do not rely on parallel data. To this end, we propose a diffusion-based generative model for speech emotion conversion, the EmoConv-Diff, that is trained to reconstruct an input utterance while also conditioning on its emotion. Subsequently, at inference, a target emotion embedding is employed to convert the emotion of the input utterance to the given target emotion. As opposed to performing emotion conversion on categorical representations, we use a continuous arousal dimension to represent emotions while also achieving intensity control. We validate the proposed methodology on a large in-the-wild dataset, the MSP-Podcast v1.10. Our results show that the proposed diffusion model is indeed capable of synthesizing speech with a controllable target emotion. Crucially, the proposed approach shows improved performance along the extreme values of arousal and thereby addresses a common challenge in the speech emotion conversion literature.
翻译:摘要:语音情感转换是将语音话语的表达情感转换为目标情感,同时保留词汇内容和说话者身份的任务。尽管现有的大多数语音情感转换工作依赖于表演型数据集和平行数据样本,但在这项工作中,我们特别关注更具挑战性的野外场景,且不依赖于平行数据。为此,我们提出了一种基于扩散的生成模型EmoConv-Diff,用于语音情感转换,该模型在训练过程中旨在重构输入话语,同时以情感为条件。随后,在推理阶段,使用目标情感嵌入将输入话语的情感转换为给定的目标情感。与基于类别表示进行情感转换不同,我们使用连续的唤醒度维度来表示情感,同时实现强度控制。我们在一个大型野外数据集MSP-Podcast v1.10上验证了所提出的方法。结果表明,所提出的扩散模型确实能够合成具有可控目标情感的语音。关键是,所提出的方法在唤醒度的极端值上显示出改进的性能,从而解决了语音情感转换文献中的一个常见挑战。