There are significant challenges for speaker adaptation in text-to-speech for languages that are not widely spoken or for speakers with accents or dialects that are not well-represented in the training data. To address this issue, we propose the use of the "mixture of adapters" method. This approach involves adding multiple adapters within a backbone-model layer to learn the unique characteristics of different speakers. Our approach outperforms the baseline, with a noticeable improvement of 5% observed in speaker preference tests when using only one minute of data for each new speaker. Moreover, following the adapter paradigm, we fine-tune only the adapter parameters (11% of the total model parameters). This is a significant achievement in parameter-efficient speaker adaptation, and one of the first models of its kind. Overall, our proposed approach offers a promising solution to the speech synthesis techniques, particularly for adapting to speakers from diverse backgrounds.
翻译:在文本转语音(TTS)中,对非广泛使用语言或训练数据中代表性不足的口音/方言的说话者进行自适应,面临着重大挑战。为解决此问题,我们提出采用“适配器混合”方法。该方法在骨干模型层内添加多个适配器,以学习不同说话者的独特特征。我们的方法优于基线,在每次新说话者仅使用一分钟数据时,说话者偏好测试中观察到5%的显著提升。此外,遵循适配器范式,我们仅微调适配器参数(占模型总参数的11%)。这是参数高效说话者自适应领域的一项重要成就,也是首批此类模型之一。总体而言,我们提出的方法为语音合成技术提供了有前景的解决方案,尤其适用于适应来自多样化背景的说话者。