Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2.0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms. It is however unclear which speech SSL is the better fit for TTS, and whether or not the performance differs between read and spontaneous TTS, the later of which is arguably more challenging. This study aims at addressing these questions by testing several speech SSLs, including different layers of the same SSL, in two-stage TTS on both read and spontaneous corpora, while maintaining constant TTS model architecture and training settings. Results from listening tests show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS. Audio examples can be found at https://www.speech.kth.se/tts-demos/ssr_tts
翻译:近期研究探索了使用自监督学习(SSL)语音表示(如wav2vec2.0)替代传统梅尔频谱图,作为标准两阶段TTS中的表示媒介。然而,目前尚不清楚哪种语音SSL更适合TTS,且其在朗读与自发性TTS(后者通常更具挑战性)中的性能是否存在差异。本研究旨在通过测试多种语音SSL(包括同一SSL的不同层)在两阶段TTS中的应用来回答这些问题,实验在朗读和自发性语料库上进行,同时保持TTS模型架构和训练设置不变。听力测试结果表明,在朗读和自发性TTS中,12层wav2vec2.0(经ASR微调)的第9层性能优于其他测试的SSL及梅尔频谱图。我们的工作揭示了语音SSL如何能够直接改进现有TTS系统,以及SSL在TTS这一生成任务中的对比表现。音频示例见:https://www.speech.kth.se/tts-demos/ssr_tts