This paper describes the zero-shot spontaneous style TTS system for the ISCSLP 2024 Conversational Voice Clone Challenge (CoVoC). We propose a LLaMA-based codec language model with a delay pattern to achieve spontaneous style voice cloning. To improve speech intelligibility, we introduce the Classifier-Free Guidance (CFG) strategy in the language model to strengthen conditional guidance on token prediction. To generate high-quality utterances, we adopt effective data preprocessing operations and fine-tune our model with selected high-quality spontaneous speech data. The official evaluations in the CoVoC constrained track show that our system achieves the best speech naturalness MOS of 3.80 and obtains considerable speech quality and speaker similarity results.
翻译:本文介绍了为ISCSLP 2024对话语音克隆挑战赛(CoVoC)开发的零样本自发风格文本转语音系统。我们提出了一种基于LLaMA的编解码器语言模型,通过引入延迟模式以实现自发风格的语音克隆。为提高语音清晰度,我们在语言模型中引入了无分类器引导策略,以增强对令牌预测的条件性引导。为生成高质量语音,我们采用了有效的数据预处理操作,并使用精选的高质量自发语音数据对模型进行微调。CoVoC约束赛道的官方评估表明,我们的系统取得了最佳语音自然度MOS得分3.80,并在语音质量与说话人相似度方面获得了可观的结果。