In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.
翻译:本文提出StyleTTS 2,一种利用风格扩散和大型语音语言模型(SLM)对抗训练的文本到语音(TTS)模型,实现了人类级别的TTS合成。StyleTTS 2区别于其前身之处在于:通过扩散模型将风格建模为潜在随机变量,无需参考语音即可为文本生成最合适的风格,在实现高效潜在扩散的同时受益于扩散模型提供的多样化语音合成能力。此外,我们采用大型预训练SLM(如WavLM)作为鉴别器,结合新颖的可微时长建模进行端到端训练,从而提升语音自然度。经母语英语使用者评估,StyleTTS 2在单说话人LJSpeech数据集上超越人类录音,在多说话人VCTK数据集上达到人类录音水平。此外,在LibriTTS数据集上训练时,本模型在零样本说话人自适应任务上优于以往公开模型。这项研究首次在单说话人和多说话人数据集上实现人类级别的TTS,彰显了风格扩散与大型SLM对抗训练的潜力。音频演示与源代码已发布于https://styletts2.github.io/。