While the performance of cross-lingual TTS based on monolingual corpora has been significantly improved recently, generating cross-lingual speech still suffers from the foreign accent problem, leading to limited naturalness. Besides, current cross-lingual methods ignore modeling emotion, which is indispensable paralinguistic information in speech delivery. In this paper, we propose DiCLET-TTS, a Diffusion model based Cross-Lingual Emotion Transfer method that can transfer emotion from a source speaker to the intra- and cross-lingual target speakers. Specifically, to relieve the foreign accent problem while improving the emotion expressiveness, the terminal distribution of the forward diffusion process is parameterized into a speaker-irrelevant but emotion-related linguistic prior by a prior text encoder with the emotion embedding as a condition. To address the weaker emotional expressiveness problem caused by speaker disentanglement in emotion embedding, a novel orthogonal projection based emotion disentangling module (OP-EDM) is proposed to learn the speaker-irrelevant but emotion-discriminative embedding. Moreover, a condition-enhanced DPM decoder is introduced to strengthen the modeling ability of the speaker and the emotion in the reverse diffusion process to further improve emotion expressiveness in speech delivery. Cross-lingual emotion transfer experiments show the superiority of DiCLET-TTS over various competitive models and the good design of OP-EDM in learning speaker-irrelevant but emotion-discriminative embedding.
翻译:尽管基于单语语料库的跨语言文本转语音(TTS)性能近年来显著提升,但跨语言语音生成仍存在口音问题,导致自然度受限。此外,当前跨语言方法忽略了对情感这一语音传递中不可或缺的副语言信息的建模。本文提出DiCLET-TTS,一种基于扩散模型的跨语言情感迁移方法,可将源说话人的情感迁移至语内及跨语言目标说话人。具体而言,为缓解口音问题并提升情感表现力,前向扩散过程的终端分布通过先验文本编码器结合情感嵌入作为条件,参数化为与说话人无关但关联情感的语言先验。针对情感嵌入中因说话人解耦导致的情感表现力不足问题,提出基于正交投影的情感解耦模块(OP-EDM),以学习与说话人无关但具情感判别性的嵌入。此外,引入条件增强型扩散概率模型(DPM)解码器,在反向扩散过程中强化说话人与情感的建模能力,进一步提升语音传递中的情感表现力。跨语言情感迁移实验表明,DiCLET-TTS在多种竞争模型中表现优越,且OP-EDM在设计上能有效学习与说话人无关但具情感判别性的嵌入。