In this paper, we propose Prosody-aware VITS (PAVITS) for emotional voice conversion (EVC), aiming to achieve two major objectives of EVC: high content naturalness and high emotional naturalness, which are crucial for meeting the demands of human perception. To improve the content naturalness of converted audio, we have developed an end-to-end EVC architecture inspired by the high audio quality of VITS. By seamlessly integrating an acoustic converter and vocoder, we effectively address the common issue of mismatch between emotional prosody training and run-time conversion that is prevalent in existing EVC models. To further enhance the emotional naturalness, we introduce an emotion descriptor to model the subtle prosody variations of different speech emotions. Additionally, we propose a prosody predictor, which predicts prosody features from text based on the provided emotion label. Notably, we introduce a prosody alignment loss to establish a connection between latent prosody features from two distinct modalities, ensuring effective training. Experimental results show that the performance of PAVITS is superior to the state-of-the-art EVC methods. Speech Samples are available at https://jeremychee4.github.io/pavits4EVC/ .
翻译:本文提出韵律感知VITS(PAVITS)用于情感语音转换,旨在实现EVC的两大目标:高内容自然度与高情感自然度,这对满足人类感知需求至关重要。为提升转换音频的内容自然度,我们受VITS高音频质量启发,开发了端到端EVC架构。通过无缝集成声学转换器与声码器,有效解决了现有EVC模型中情感韵律训练与运行时转换不匹配的常见问题。为进一步增强情感自然度,我们引入情感描述子建模不同语音情感的细微韵律变化。此外,提出韵律预测器,根据给定情感标签从文本中预测韵律特征。值得注意的是,我们引入韵律对齐损失建立两种模态潜在韵律特征之间的联系,确保有效训练。实验结果表明,PAVITS性能优于当前最优EVC方法。语音样本可访问https://jeremychee4.github.io/pavits4EVC/。