Large language models (LLMs) have gained considerable attention for Artificial Intelligence Generated Content (AIGC), particularly with the emergence of ChatGPT. However, the direct adaptation of continuous speech to LLMs that process discrete tokens remains an unsolved challenge, hindering the application of LLMs for speech generation. The advanced speech LMs are in the corner, as that speech signals encapsulate a wealth of information, including speaker and emotion, beyond textual data alone. Prompt tuning has demonstrated notable gains in parameter efficiency and competitive performance on some speech classification tasks. However, the extent to which prompts can effectively elicit generation tasks from speech LMs remains an open question. In this paper, we present pioneering research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen, with around 10M trainable parameters. The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs, which will significantly enhance the capabilities of the framework. The code and demos of SpeechGen will be available on the project website: \url{https://ga642381.github.io/SpeechPrompt/speechgen}
翻译:大型语言模型(LLMs)在人工智能生成内容(AIGC)领域引起了广泛关注,尤其是随着ChatGPT的出现。然而,将连续语音直接适配到处理离散标记的LLMs仍然是一个未解决的挑战,这阻碍了LLMs在语音生成中的应用。先进的语音语言模型即将到来,因为语音信号封装了丰富的信息,包括说话者和情感,超越了文本数据。提示调整已在某些语音分类任务中展现出显著的参数效率提升和竞争性能。然而,提示能否有效激发语音语言模型执行生成任务,仍是一个悬而未决的问题。本文提出了一项开创性研究,探讨在统一框架SpeechGen中应用提示调整来激发语音语言模型完成各种生成任务,该框架仅包含约1000万个可训练参数。该统一框架在效率和有效性方面具有巨大潜力,特别是随着先进语音语言模型的即将到来,这将显著增强框架的能力。SpeechGen的代码和演示将在项目网站上提供:\url{https://ga642381.github.io/SpeechPrompt/speechgen}