Emotional Text-To-Speech (TTS) is an important task in the development of systems (e.g., human-like dialogue agents) that require natural and emotional speech. Existing approaches, however, only aim to produce emotional TTS for seen speakers during training, without consideration of the generalization to unseen speakers. In this paper, we propose ZET-Speech, a zero-shot adaptive emotion-controllable TTS model that allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label. Specifically, to enable a zero-shot adaptive TTS model to synthesize emotional speech, we propose domain adversarial learning and guidance methods on the diffusion model. Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers. Samples are at https://ZET-Speech.github.io/ZET-Speech-Demo/.
翻译:情感语音合成(Emotional Text-To-Speech, TTS)是构建需要自然且富有感情语音的系统(如类人对话智能体)中的重要任务。然而,现有方法仅旨在对训练中已见说话人进行情感TTS合成,未考虑对未见说话人的泛化能力。本文提出ZET-Speech——一种零样本自适应情感可控TTS模型,用户仅需一段简短的中性语音片段及目标情感标签,即可合成任意说话人的情感语音。具体而言,为赋予零样本自适应TTS模型合成情感语音的能力,我们提出了针对扩散模型的域对抗学习与引导方法。实验结果表明,ZET-Speech能成功为已见及未见说话人合成具有期望情感的自然且富有感情语音。样本见 https://ZET-Speech.github.io/ZET-Speech-Demo/。