Conventional text-to-speech (TTS) research has predominantly focused on enhancing the quality of synthesized speech for speakers in the training dataset. The challenge of synthesizing lifelike speech for unseen, out-of-dataset speakers, especially those with limited reference data, remains a significant and unresolved problem. While zero-shot or few-shot speaker-adaptive TTS approaches have been explored, they have many limitations. Zero-shot approaches tend to suffer from insufficient generalization performance to reproduce the voice of speakers with heavy accents. While few-shot methods can reproduce highly varying accents, they bring a significant storage burden and the risk of overfitting and catastrophic forgetting. In addition, prior approaches only provide either zero-shot or few-shot adaptation, constraining their utility across varied real-world scenarios with different demands. Besides, most current evaluations of speaker-adaptive TTS are conducted only on datasets of native speakers, inadvertently neglecting a vast portion of non-native speakers with diverse accents. Our proposed framework unifies both zero-shot and few-shot speaker adaptation strategies, which we term as "instant" and "fine-grained" adaptations based on their merits. To alleviate the insufficient generalization performance observed in zero-shot speaker adaptation, we designed two innovative discriminators and introduced a memory mechanism for the speech decoder. To prevent catastrophic forgetting and reduce storage implications for few-shot speaker adaptation, we designed two adapters and a unique adaptation procedure.
翻译:传统文本转语音研究主要集中于提升训练数据集中说话人合成语音的质量。为未见过的、数据集之外的说话人(尤其是参考数据有限的说话人)合成逼真语音的挑战,仍是一个重大且未解决的问题。虽然已有研究探索了零样本或少样本说话人自适应文本转语音方法,但这些方法存在诸多局限性。零样本方法在重现带有浓重口音说话人声音时,往往泛化性能不足;而少样本方法虽能重现高度变化的语音特征,但会带来显著的存储负担,并存在过拟合和灾难性遗忘的风险。此外,现有方法仅支持零样本或少样本单一模式的适应,限制了其在具有不同需求的多变实际场景中的适用性。同时,当前多数说话人自适应文本转语音评估仅在母语说话人数据集上进行,无意中忽视了占较大比例、具有多种口音的非母语说话人群体。我们提出的框架统一了零样本和少样本说话人自适应策略,根据其特点分别称为"即时适应"与"细粒度适应"。为缓解零样本说话人自适应中泛化性能不足的问题,我们设计了两种创新判别器,并为语音解码器引入记忆机制;为防止少样本说话人自适应中的灾难性遗忘并降低存储开销,我们设计了两种适配器及独特的自适应流程。