This paper presents a parameter-efficient learning (PEL) to develop a low-resource accent adaptation for text-to-speech (TTS). A resource-efficient adaptation from a frozen pre-trained TTS model is developed by using only 1.2\% to 0.8\% of original trainable parameters to achieve competitive performance in voice synthesis. Motivated by a theoretical foundation of optimal transport (OT), this study carries out PEL for TTS where an auxiliary unsupervised loss based on OT is introduced to maximize a difference between the pre-trained source domain and the (unseen) target domain, in addition to its supervised training loss. Further, we leverage upon this unsupervised loss refinement to boost system performance via either sliced Wasserstein distance or maximum mean discrepancy. The merit of this work is demonstrated by fulfilling PEL solutions based on residual adapter learning, and model reprogramming when evaluating the Mandarin accent adaptation. Experiment results show that the proposed methods can achieve competitive naturalness with parameter-efficient decoder fine-tuning, and the auxiliary unsupervised loss improves model performance empirically.
翻译:本文提出一种参数高效学习方法,用于开发低资源场景下的文本到语音(TTS)口音自适应。通过仅使用原始可训练参数的1.2%至0.8%,从冻结的预训练TTS模型中实现了资源高效的自适应,并在语音合成中取得具有竞争力的性能。基于最优传输理论的理论基础,本研究在TTS中引入参数高效学习,在监督训练损失之外,额外引入基于最优传输的无监督辅助损失,以最大化预训练源域与(未见过的)目标域之间的差异。此外,我们利用这种无监督损失精化机制,通过切片Wasserstein距离或最大均值差异来提升系统性能。本文通过残差适配器学习和模型重编程两种参数高效学习方案,在普通话口音自适应任务中验证了该方法的优势。实验结果表明,所提方法能够在参数高效解码器微调中实现具有竞争力的自然度,且辅助无监督损失在经验上提升了模型性能。