Zero-shot emotion transfer in cross-lingual speech synthesis aims to transfer emotion from an arbitrary speech reference in the source language to the synthetic speech in the target language. Building such a system faces challenges of unnatural foreign accents and difficulty in modeling the shared emotional expressions of different languages. Building on the DelightfulTTS neural architecture, this paper addresses these challenges by introducing specifically-designed modules to model the language-specific prosody features and language-shared emotional expressions separately. Specifically, the language-specific speech prosody is learned by a non-autoregressive predictive coding (NPC) module to improve the naturalness of the synthetic cross-lingual speech. The shared emotional expression between different languages is extracted from a pre-trained self-supervised model HuBERT with strong generalization capabilities. We further use hierarchical emotion modeling to capture more comprehensive emotions across different languages. Experimental results demonstrate the proposed framework's effectiveness in synthesizing bi-lingual emotional speech for the monolingual target speaker without emotional training data.
翻译:跨语言语音合成中的零样本情感迁移旨在将源语言中任意语音参考的情感迁移至目标语言的合成语音中。构建此类系统面临两大挑战:不自然的外国口音以及不同语言共享情感表达的建模难题。本文基于DelightfulTTS神经架构,通过引入专门设计的模块分别建模语言特有的韵律特征与语言共享的情感表达,从而应对上述挑战。具体而言,语言特有的语音韵律由非自回归预测编码(NPC)模块学习,以提升跨语言合成语音的自然度;不同语言间的共享情感表达则通过具有强大泛化能力的预训练自监督模型HuBERT提取。我们进一步采用分层情感建模方法,以捕获跨语言更全面的情感特征。实验结果表明,所提框架能有效为缺乏情感训练数据的单语目标说话人合成双语情感语音。