Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. Existing approaches cannot well express fine-grained emotional attributes. In this paper, we propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion. We introduce a two-stage pipeline to effectively train our network: Stage I utilizes inter-speech contrastive learning to model fine-grained emotion and intra-speech disentanglement learning to better separate emotion and content. In Stage II, we propose to regularize the conversion with a multi-view consistency mechanism. This technique helps us transfer fine-grained emotion and maintain speech content. Extensive experiments show that our AINN outperforms state-of-the-arts in both objective and subjective metrics.
翻译:情感语音转换旨在根据给定情感操控语音,同时保留非情感成分。现有方法难以充分表达细粒度的情感属性。本文提出一种基于注意力的交互式解耦网络(AINN),利用实例级情感知识进行语音转换。我们引入两阶段训练流程以有效训练网络:第一阶段利用语音间对比学习建模细粒度情感,并通过语音内解耦学习更好地分离情感与内容;第二阶段提出采用多视角一致性机制对转换过程进行正则化。该技术有助于迁移细粒度情感并保持语音内容。大量实验表明,我们的AINN在主客观指标上均优于现有最优方法。