Large language models (LLMs) have made significant advancements in natural language processing and are concurrently extending the language ability to other modalities, such as speech and vision. Nevertheless, most of the previous work focuses on prompting LLMs with perception abilities like auditory comprehension, and the effective approach for augmenting LLMs with speech synthesis capabilities remains ambiguous. In this paper, we conduct a comprehensive empirical exploration of boosting LLMs with the ability to generate speech, by combining pre-trained LLM LLaMA/OPT and text-to-speech synthesis model VALL-E. We compare three integration methods between LLMs and speech synthesis models, including directly fine-tuned LLMs, superposed layers of LLMs and VALL-E, and coupled LLMs and VALL-E using LLMs as a powerful text encoder. Experimental results show that, using LoRA method to fine-tune LLMs directly to boost the speech synthesis capability does not work well, and superposed LLMs and VALL-E can improve the quality of generated speech both in speaker similarity and word error rate (WER). Among these three methods, coupled methods leveraging LLMs as the text encoder can achieve the best performance, making it outperform original speech synthesis models with a consistently better speaker similarity and a significant (10.9%) WER reduction.
翻译:大语言模型(LLMs)在自然语言处理领域取得了显著进展,并同时将语言能力扩展到其他模态,如语音和视觉。然而,以往的研究主要集中于通过听觉理解等感知能力来提示LLMs,而增强LLMs语音合成能力的有效方法仍不明确。本文通过结合预训练LLM LLaMA/OPT和文本转语音合成模型VALL-E,对提升LLMs生成语音的能力进行了全面的实证探索。我们比较了LLM与语音合成模型之间的三种集成方法,包括直接微调LLMs、叠加LLMs和VALL-E的层、以及将LLMs作为强大文本编码器与VALL-E耦合。实验结果表明,使用LoRA方法直接微调LLMs以提升语音合成能力效果不佳,而叠加LLMs和VALL-E可改善生成语音的说话人相似度和词错误率(WER)。在这三种方法中,利用LLMs作为文本编码器的耦合方法可实现最佳性能,在说话人相似度上持续优于原始语音合成模型,并显著降低了10.9%的WER。