Rapidly developing large language models (LLMs) have brought tremendous intelligent applications. Especially, the GPT-4o's excellent duplex speech interaction ability has brought impressive experience to users. Researchers have recently proposed several multi-modal LLMs in this direction that can achieve user-agent speech-to-speech conversations. This paper proposes a novel speech-text multimodal LLM architecture called Freeze-Omni. Our main contribution is that the speech input and output modalities can be easily connected to a textual LLM while keeping the LLM's parameters frozen throughout the training process. We design a three-stage training strategy for modeling both the speech input and output, enabling Freeze-Omni to obtain speech-to-speech conversation ability using text-speech paired data (such as ASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs. Moreover, we can effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level compared with that in the text modality of its backbone LLM, while achieving low latency end-to-end spoken response. In addition, we also designed a method to achieve duplex dialogue ability through multi-task training, giving Freeze-Omni a more natural style of dialogue ability between users and agents. In summary, Freeze-Omni holds great potential to conduct speech-to-speech dialogue based on a multimodal LLM under the condition of a frozen LLM, avoiding the catastrophic forgetting problem caused by limited data and training resources.
翻译:快速发展的大语言模型(LLM)带来了巨大的智能应用。特别是,GPT-4o 卓越的双工语音交互能力为用户带来了令人印象深刻的体验。近期,研究者们已在此方向上提出了几种能够实现用户与智能体语音到语音对话的多模态大语言模型。本文提出了一种新颖的语音-文本多模态大语言模型架构,称为 Freeze-Omni。我们的主要贡献在于,语音输入和输出模态可以轻松地连接到文本大语言模型,同时在整个训练过程中保持大语言模型的参数冻结。我们设计了一个三阶段训练策略来建模语音输入和输出,使得 Freeze-Omni 能够利用语音-文本配对数据(例如 ASR 和 TTS 数据)以及仅 60,000 轮多轮文本问答数据,在 8 个 GPU 上获得语音到语音的对话能力。此外,我们能够有效确保 Freeze-Omni 在语音模态下的智能水平与其骨干大语言模型在文本模态下的水平相当,同时实现低延迟的端到端语音响应。另外,我们还设计了一种通过多任务训练实现双工对话能力的方法,赋予 Freeze-Omni 更自然的用户与智能体对话风格。总之,Freeze-Omni 在冻结大语言模型的条件下,基于多模态大语言模型进行语音到语音对话具有巨大潜力,避免了因数据和训练资源有限而导致的灾难性遗忘问题。