Recent advancements in Natural Language Processing (NLP) have seen Large-scale Language Models (LLMs) excel at producing high-quality text for various purposes. Notably, in Text-To-Speech (TTS) systems, the integration of BERT for semantic token generation has underscored the importance of semantic content in producing coherent speech outputs. Despite this, the specific utility of LLMs in enhancing TTS synthesis remains considerably limited. This research introduces an innovative approach, Llama-VITS, which enhances TTS synthesis by enriching the semantic content of text using LLM. Llama-VITS integrates semantic embeddings from Llama2 with the VITS model, a leading end-to-end TTS framework. By leveraging Llama2 for the primary speech synthesis process, our experiments demonstrate that Llama-VITS matches the naturalness of the original VITS (ORI-VITS) and those incorporate BERT (BERT-VITS), on the LJSpeech dataset, a substantial collection of neutral, clear speech. Moreover, our method significantly enhances emotive expressiveness on the EmoV_DB_bea_sem dataset, a curated selection of emotionally consistent speech from the EmoV_DB dataset, highlighting its potential to generate emotive speech.
翻译:摘要:自然语言处理领域的最新进展表明,大规模语言模型在生成高质量文本方面表现卓越。尤其值得关注的是,在文本转语音系统中,集成BERT进行语义标记生成的做法,凸显了语义内容对生成连贯语音输出的重要性。尽管如此,大规模语言模型在提升语音合成性能方面的具体效用仍然十分有限。本研究提出了一种创新方法——Llama-VITS,通过利用大规模语言模型丰富文本语义内容来增强语音合成效果。Llama-VITS将Llama2的语义嵌入与VITS模型(一种领先的端到端语音合成框架)相融合。通过将Llama2用于主要语音合成流程,实验表明,在LJSpeech数据集(包含大量中性清晰语音的语料库)上,Llama-VITS在自然度方面与原始VITS及集成BERT的BERT-VITS相当。此外,在EmoV_DB_bea_sem数据集(从EmoV_DB数据集中筛选出情感一致语音的精选子集)上,本方法显著提升了情感表现力,凸显了其在生成情感化语音方面的潜力。