The advancements in zero-shot text-to-speech (TTS) methods, based on large-scale models, have demonstrated high fidelity in reproducing speaker characteristics. However, these models are too large for practical daily use. We propose a lightweight zero-shot TTS method using a mixture of adapters (MoA). Our proposed method incorporates MoA modules into the decoder and the variance adapter of a non-autoregressive TTS model. These modules enhance the ability to adapt a wide variety of speakers in a zero-shot manner by selecting appropriate adapters associated with speaker characteristics on the basis of speaker embeddings. Our method achieves high-quality speech synthesis with minimal additional parameters. Through objective and subjective evaluations, we confirmed that our method achieves better performance than the baseline with less than 40\% of parameters at 1.9 times faster inference speed. Audio samples are available on our demo page (https://ntt-hilab-gensp.github.io/is2024lightweightTTS/).
翻译:基于大规模模型的零样本文本到语音合成方法在再现说话人特征方面已展现出高保真度的优势。然而,这些模型的参数量过大,难以在实际日常场景中部署。本文提出一种采用混合适配器的轻量级零样本TTS方法。该方法将MoA模块集成到非自回归TTS模型的解码器与方差适配器中,通过基于说话人嵌入向量选择与说话人特征相关联的适配器,显著增强了模型以零样本方式适应多样化说话人的能力。所提方法能以极少的附加参数量实现高质量语音合成。通过主客观实验评估,我们证实该方法在参数量减少60%以上、推理速度提升1.9倍的条件下,性能仍优于基线模型。音频样本已发布于演示页面(https://ntt-hilab-gensp.github.io/is2024lightweightTTS/)。