Emotional voice conversion (EVC) seeks to modify the emotional tone of a speaker's voice while preserving the original linguistic content and the speaker's unique vocal characteristics. Recent advancements in EVC have involved the simultaneous modeling of pitch and duration, utilizing the potential of sequence-to-sequence (seq2seq) models. To enhance reliability and efficiency in conversion, this study shifts focus towards parallel speech generation. We introduce Duration-Flexible EVC (DurFlex-EVC), which integrates a style autoencoder and unit aligner. Traditional models, while incorporating self-supervised learning (SSL) representations that contain both linguistic and paralinguistic information, have neglected this dual nature, leading to reduced controllability. Addressing this issue, we implement cross-attention to synchronize these representations with various emotions. Additionally, a style autoencoder is developed for the disentanglement and manipulation of style elements. The efficacy of our approach is validated through both subjective and objective evaluations, establishing its superiority over existing models in the field.
翻译:情感语音转换(EVC)旨在改变说话者的情感语调,同时保留原始语言内容和说话者的独特嗓音特征。近期EVC研究通过利用序列到序列(seq2seq)模型的潜力,实现了音高与时长的联合建模。为提升转换的可靠性和效率,本研究将重点转向并行语音生成。我们提出了时长灵活情感语音转换(DurFlex-EVC),该方法集成了风格自编码器和单元对齐器。传统模型虽采用包含语言与副语言信息的自监督学习(SSL)表征,但忽略了其双重特性,导致可控性降低。针对此问题,我们利用交叉注意力机制实现这些表征与不同情感间的同步。此外,我们开发了风格自编码器以实现风格元素的解耦与操控。通过主观与客观评估验证了本方法的有效性,证实其在现有模型中的优越性。