Expressive voice conversion (VC) conducts speaker identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Emotional style modeling for arbitrary speakers in expressive VC has not been extensively explored. Previous approaches have relied on vocoders for speech reconstruction, which makes speech quality heavily dependent on the performance of vocoders. A major challenge of expressive VC lies in emotion prosody modeling. To address these challenges, this paper proposes a fully end-to-end expressive VC framework based on a conditional denoising diffusion probabilistic model (DDPM). We utilize speech units derived from self-supervised speech models as content conditioning, along with deep features extracted from speech emotion recognition and speaker verification systems to model emotional style and speaker identity. Objective and subjective evaluations show the effectiveness of our framework. Codes and samples are publicly available.
翻译:情感语音转换(Expressive Voice Conversion)通过联合转换说话人身份与情感风格,实现情感化说话人的身份转换。在情感语音转换中,针对任意说话人的情感风格建模尚未得到充分研究。先前的方法依赖声码器进行语音重建,这使得语音质量严重受限于声码器的性能。情感语音转换的主要挑战在于情感韵律建模。为解决这些问题,本文提出了一种完全端到端的情感语音转换框架,该框架基于条件去噪扩散概率模型(DDPM)。我们利用从自监督语音模型中提取的语音单元作为内容条件,并结合从语音情感识别和说话人验证系统中提取的深层特征来建模情感风格与说话人身份。客观与主观评估均验证了本框架的有效性。代码与样本已公开提供。