We propose a new model architecture specifically suited for text-to-speech (TTS) models. We combine WavLM, a pre-trained self-supervised learning (SSL) speech model, and the BEST-RQ vector quantization framework. We assess the extent to which the more task-agnostic WavLM, coupled with the superior suitability of the simplistic BEST-RQ framework for a wider array of downstream tasks, yields favorable outcomes. Experiments on the LibriSpeech dataset with SUPERB benchmarking assert that the proposed model significantly underperforms. We speculate the underlying reason for this performance is related to the difference between featurizing raw audio waveforms and spectrograms with a quantizer. We discuss the limitations of this approach to better guide future advancements in TTS.
翻译:我们提出了一种特别适用于文本到语音(TTS)模型的新架构。该架构结合了WavLM(一种预训练的自监督学习语音模型)和BEST-RQ向量量化框架。我们评估了任务无关性更强的WavLM与结构简单、更广泛适用于下游任务的BEST-RQ框架相结合时,能否产生理想的结果。在LibriSpeech数据集上使用SUPERB基准进行的实验表明,所提模型的性能显著不佳。我们推测性能不佳的根本原因与量化器对原始音频波形和频谱图进行特征化处理之间的差异有关。我们讨论了该方法的局限性,以期为未来TTS技术的改进提供指导。