USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

Conventional text-to-speech (TTS) research has predominantly focused on enhancing the quality of synthesized speech for speakers in the training dataset. The challenge of synthesizing lifelike speech for unseen, out-of-dataset speakers, especially those with limited reference data, remains a significant and unresolved problem. While zero-shot or few-shot speaker-adaptive TTS approaches have been explored, they have many limitations. Zero-shot approaches tend to suffer from insufficient generalization performance to reproduce the voice of speakers with heavy accents. While few-shot methods can reproduce highly varying accents, they bring a significant storage burden and the risk of overfitting and catastrophic forgetting. In addition, prior approaches only provide either zero-shot or few-shot adaptation, constraining their utility across varied real-world scenarios with different demands. Besides, most current evaluations of speaker-adaptive TTS are conducted only on datasets of native speakers, inadvertently neglecting a vast portion of non-native speakers with diverse accents. Our proposed framework unifies both zero-shot and few-shot speaker adaptation strategies, which we term as "instant" and "fine-grained" adaptations based on their merits. To alleviate the insufficient generalization performance observed in zero-shot speaker adaptation, we designed two innovative discriminators and introduced a memory mechanism for the speech decoder. To prevent catastrophic forgetting and reduce storage implications for few-shot speaker adaptation, we designed two adapters and a unique adaptation procedure.

翻译：传统文本转语音研究主要集中于提升训练数据集中说话人合成语音的质量。为未见过的、数据集之外的说话人（尤其是参考数据有限的说话人）合成逼真语音的挑战，仍是一个重大且未解决的问题。虽然已有研究探索了零样本或少样本说话人自适应文本转语音方法，但这些方法存在诸多局限性。零样本方法在重现带有浓重口音说话人声音时，往往泛化性能不足；而少样本方法虽能重现高度变化的语音特征，但会带来显著的存储负担，并存在过拟合和灾难性遗忘的风险。此外，现有方法仅支持零样本或少样本单一模式的适应，限制了其在具有不同需求的多变实际场景中的适用性。同时，当前多数说话人自适应文本转语音评估仅在母语说话人数据集上进行，无意中忽视了占较大比例、具有多种口音的非母语说话人群体。我们提出的框架统一了零样本和少样本说话人自适应策略，根据其特点分别称为"即时适应"与"细粒度适应"。为缓解零样本说话人自适应中泛化性能不足的问题，我们设计了两种创新判别器，并为语音解码器引入记忆机制；为防止少样本说话人自适应中的灾难性遗忘并降低存储开销，我们设计了两种适配器及独特的自适应流程。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日