Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2.0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms. It is however unclear which speech SSL is the better fit for TTS, and whether or not the performance differs between read and spontaneous TTS, the later of which is arguably more challenging. This study aims at addressing these questions by testing several speech SSLs, including different layers of the same SSL, in two-stage TTS on both read and spontaneous corpora, while maintaining constant TTS model architecture and training settings. Results from listening tests show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS. Audio examples can be found at https://www.speech.kth.se/tts-demos/ssr_tts
翻译:近期研究探索了在标准两阶段文本转语音(TTS)系统中,使用如wav2vec2.0等自监督学习(SSL)语音表征替代传统梅尔频谱图作为表征媒介。然而,目前尚不清楚哪种语音SSL更适用于TTS,以及在阅读语音与自发语音合成(后者通常更具挑战性)之间性能是否存在差异。本研究旨在通过测试多种语音SSL(包括同一SSL的不同层),在保持TTS模型架构与训练设置不变的前提下,将其应用于阅读与自发语音语料的两阶段TTS中,以解答上述问题。听力测试结果表明,在阅读与自发语音合成中,12层wav2vec2.0(经ASR微调)的第9层表现优于其他测试的SSL及梅尔频谱图。本研究揭示了语音SSL如何能够直接改进现有TTS系统,以及各SSL在TTS这一生成性挑战任务中的性能对比。音频示例请见 https://www.speech.kth.se/tts-demos/ssr_tts