Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like

Naoyuki Kanda,Xiaofei Wang,Sefik Emre Eskimez,Manthan Thakker,Hemin Yang,Zirun Zhu,Min Tang,Canrun Li,Steven Tsai,Zhen Xiao,Yufei Xia,Jinzhu Li,Yanqing Liu,Sheng Zhao,Michael Zeng

from arxiv, See https://aka.ms/elate/ for demo samples

Laughter is one of the most expressive and natural aspects of human speech, conveying emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the ability to produce realistic and appropriate laughter sounds, limiting their applications and user experience. While there have been prior works to generate natural laughter, they fell short in terms of controlling the timing and variety of the laughter to be generated. In this work, we propose ELaTE, a zero-shot TTS that can generate natural laughing speech of any speaker based on a short audio prompt with precise control of laughter timing and expression. Specifically, ELaTE works on the audio prompt to mimic the voice characteristic, the text prompt to indicate the contents of the generated speech, and the input to control the laughter expression, which can be either the start and end times of laughter, or the additional audio prompt that contains laughter to be mimicked. We develop our model based on the foundation of conditional flow-matching-based zero-shot TTS, and fine-tune it with frame-level representation from a laughter detector as additional conditioning. With a simple scheme to mix small-scale laughter-conditioned data with large-scale pre-training data, we demonstrate that a pre-trained zero-shot TTS model can be readily fine-tuned to generate natural laughter with precise controllability, without losing any quality of the pre-trained zero-shot TTS model. Through the evaluations, we show that ELaTE can generate laughing speech with significantly higher quality and controllability compared to conventional models. See https://aka.ms/elate/ for demo samples.

翻译：笑声是人类语音中最具表现力和自然性的方面之一，能够传达情感、社交信号和幽默感。然而，大多数文本到语音（TTS）系统缺乏生成真实且恰当笑声的能力，这限制了它们的应用和用户体验。尽管已有研究致力于生成自然的笑声，但在控制笑声生成的时间和多样性方面仍显不足。在本工作中，我们提出ELaTE，一种零样本文本到语音系统，能够基于简短音频提示生成任意说话者的自然笑声语音，并精确控制笑声的时间和表达方式。具体而言，ELaTE利用音频提示模仿语音特征、文本提示指示生成语音的内容，并输入控制笑声表达的信号——可以是笑声的起止时间，也可以是包含待模仿笑声的附加音频提示。我们基于条件流匹配的零样本文本到语音基础构建模型，并通过笑声检测器的帧级表示作为额外条件进行微调。借助一种将小规模带笑声条件的数据与大规模预训练数据混合的简单方案，我们证明了预训练的零样本文本到语音模型可以轻松微调为生成具备精确可控性的自然笑声，同时不损失预训练零样本文本到语音模型的任何质量。通过评估，我们表明ELaTE能够生成显著优于传统模型的高质量且可控的笑声语音。演示样本见 https://aka.ms/elate/。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日