This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve TTS quality with lower supervised data requirements via Vector-Quantized Self-Supervised Speech Representation Learning (VQ-S3RL) utilizing more unlabeled speech audio. This framework comprises two VQ-S3R learners: first, the principal learner aims to provide a generative Multi-Stage Multi-Codebook (MSMC) VQ-S3R via the MSMC-VQ-GAN combined with the contrastive S3RL, while decoding it back to the high-quality audio; then, the associate learner further abstracts the MSMC representation into a highly-compact VQ representation through a VQ-VAE. These two generative VQ-S3R learners provide profitable speech representations and pre-trained models for TTS, significantly improving synthesis quality with the lower requirement for supervised data. QS-TTS is evaluated comprehensively under various scenarios via subjective and objective tests in experiments. The results powerfully demonstrate the superior performance of QS-TTS, winning the highest MOS over supervised or semi-supervised baseline TTS approaches, especially in low-resource scenarios. Moreover, comparing various speech representations and transfer learning methods in TTS further validates the notable improvement of the proposed VQ-S3RL to TTS, showing the best audio quality and intelligibility metrics. The trend of slower decay in the synthesis quality of QS-TTS with decreasing supervised data further highlights its lower requirements for supervised data, indicating its great potential in low-resource scenarios.
翻译:本文提出了一种新颖的半监督TTS框架QS-TTS,通过利用更多未标注语音音频的矢量量化自监督语音表示学习(VQ-S3RL),在降低监督数据需求的同时提升TTS质量。该框架包含两个VQ-S3R学习器:首先,主学习器旨在通过结合对比式S3RL的MSMC-VQ-GAN,提供生成式多阶段多码本(MSMC)VQ-S3R,并将其解码重建为高质量音频;随后,辅助学习器通过VQ-VAE将MSMC表示进一步抽象为高度紧凑的VQ表示。这两个生成式VQ-S3R学习器为TTS提供了高效的语音表示和预训练模型,在显著降低监督数据需求的同时提升了合成质量。实验通过主观和客观测试,在多种场景下对QS-TTS进行了全面评估。结果有力证明了QS-TTS的优越性能,在监督或半监督基线TTS方法中获得了最高的MOS分数,尤其在低资源场景下表现突出。此外,通过比较TTS中多种语音表示和迁移学习方法,进一步验证了所提VQ-S3RL对TTS的显著改进,展现了最优的音频质量和可懂度指标。QS-TTS合成质量随监督数据减少而下降趋势放缓,进一步凸显了其对监督数据的低依赖性,表明其在低资源场景中的巨大潜力。