Benefiting from effective speech modeling, current Speech Large Language Models (SLLMs) have demonstrated exceptional capabilities in in-context speech generation and efficient generalization to unseen speakers. However, the prevailing information modeling process is encumbered by certain redundancies, leading to inefficiencies in speech generation. We propose Chain-of-Information Generation (CoIG), a method for decoupling semantic and perceptual information in large-scale speech generation. Building on this, we develop SpeechGPT-Gen, an 8-billion-parameter SLLM efficient in semantic and perceptual information modeling. It comprises an autoregressive model based on LLM for semantic information modeling and a non-autoregressive model employing flow matching for perceptual information modeling. Additionally, we introduce the novel approach of infusing semantic information into the prior distribution to enhance the efficiency of flow matching. Extensive experimental results demonstrate that SpeechGPT-Gen markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue, underscoring CoIG's remarkable proficiency in capturing and modeling speech's semantic and perceptual dimensions. Code and models are available at https://github.com/0nutation/SpeechGPT.
翻译:受益于有效的语音建模,当前的语音大语言模型在上下文语音生成及高效泛化至未见说话人方面展现出卓越能力。然而,主流的信息建模过程存在冗余问题,导致语音生成效率低下。我们提出链式信息生成方法,一种在大规模语音生成中解耦语义与感知信息的技术。基于此,我们开发了SpeechGPT-Gen——一个参数规模达80亿的语音大语言模型,在语义和感知信息建模方面具有高效性。该模型包含一个基于大语言模型的自回归模块用于语义信息建模,以及一个采用流匹配的非自回归模块用于感知信息建模。此外,我们提出将语义信息注入先验分布的新方法以提升流匹配效率。大量实验结果表明,SpeechGPT-Gen在零样本文本转语音、零样本语音转换及语音交互对话中表现显著优越,充分验证了链式信息生成方法在捕获并建模语音语义与感知维度方面的卓越能力。代码与模型已开源至 https://github.com/0nutation/SpeechGPT。