In this paper, we experimented with the SpeechT5 model pre-trained on large-scale datasets. We pre-trained the foundation model from scratch and fine-tuned it on a large-scale robust multi-speaker text-to-speech (TTS) task. We tested the model capabilities in a zero- and few-shot scenario. Based on two listening tests, we evaluated the synthetic audio quality and the similarity of how synthetic voices resemble real voices. Our results showed that the SpeechT5 model can generate a synthetic voice for any speaker using only one minute of the target speaker's data. We successfully demonstrated the high quality and similarity of our synthetic voices on publicly known Czech politicians and celebrities.
翻译:本文针对在大规模数据集上预训练的SpeechT5模型进行了实验。我们从零开始预训练了基础模型,并针对大规模鲁棒多说话人文本转语音(TTS)任务对其进行了微调。我们在零样本和少样本场景下测试了模型的能力。基于两项听力测试,我们评估了合成音频的质量以及合成语音与真实语音的相似度。结果表明,SpeechT5模型仅需一分钟目标说话人的数据即可为其生成合成语音。我们成功地在公众熟知的捷克政治人物和名人语音上验证了所生成语音的高质量与高相似度。