Laughter is one of the most expressive and natural aspects of human speech, conveying emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the ability to produce realistic and appropriate laughter sounds, limiting their applications and user experience. While there have been prior works to generate natural laughter, they fell short in terms of controlling the timing and variety of the laughter to be generated. In this work, we propose ELaTE, a zero-shot TTS that can generate natural laughing speech of any speaker based on a short audio prompt with precise control of laughter timing and expression. Specifically, ELaTE works on the audio prompt to mimic the voice characteristic, the text prompt to indicate the contents of the generated speech, and the input to control the laughter expression, which can be either the start and end times of laughter, or the additional audio prompt that contains laughter to be mimicked. We develop our model based on the foundation of conditional flow-matching-based zero-shot TTS, and fine-tune it with frame-level representation from a laughter detector as additional conditioning. With a simple scheme to mix small-scale laughter-conditioned data with large-scale pre-training data, we demonstrate that a pre-trained zero-shot TTS model can be readily fine-tuned to generate natural laughter with precise controllability, without losing any quality of the pre-trained zero-shot TTS model. Through the evaluations, we show that ELaTE can generate laughing speech with significantly higher quality and controllability compared to conventional models. See https://aka.ms/elate/ for demo samples.
翻译:笑声是人类语音中最具表现力和自然性的方面之一,能够传达情感、社交信号和幽默感。然而,大多数文本到语音(TTS)系统缺乏生成真实且恰当笑声的能力,这限制了它们的应用和用户体验。尽管已有研究致力于生成自然的笑声,但在控制笑声生成的时间和多样性方面仍显不足。在本工作中,我们提出ELaTE,一种零样本文本到语音系统,能够基于简短音频提示生成任意说话者的自然笑声语音,并精确控制笑声的时间和表达方式。具体而言,ELaTE利用音频提示模仿语音特征、文本提示指示生成语音的内容,并输入控制笑声表达的信号——可以是笑声的起止时间,也可以是包含待模仿笑声的附加音频提示。我们基于条件流匹配的零样本文本到语音基础构建模型,并通过笑声检测器的帧级表示作为额外条件进行微调。借助一种将小规模带笑声条件的数据与大规模预训练数据混合的简单方案,我们证明了预训练的零样本文本到语音模型可以轻松微调为生成具备精确可控性的自然笑声,同时不损失预训练零样本文本到语音模型的任何质量。通过评估,我们表明ELaTE能够生成显著优于传统模型的高质量且可控的笑声语音。演示样本见 https://aka.ms/elate/。