Rapidly developing large language models (LLMs) have brought tremendous intelligent applications. GPT-4o's excellent duplex speech interaction ability has recently brought impressive experience to users. Researchers have recently proposed several multi-modal LLMs in this direction that can achieve speech-to-speech dialogue. This paper proposes a novel speech-text multimodal LLM architecture called Freeze-Omni. Our main contribution is that the speech input and output modalities can be easily connected to a textual LLM while keeping the LLM's parameters frozen throughout the training process. We designed 3-stage training strategies both for the modeling of speech input and output, enabling Freeze-Omni to obtain speech-to-speech dialogue ability using text-speech paired data (such as ASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs. Moreover, we can effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level compared with that in the text modality of its backbone LLM, while the end-to-end latency of the spoken response achieves a low level. In addition, we also designed a method to achieve duplex dialogue ability through multi-task training, making Freeze-Omni have a more natural style of dialogue ability between the users. Freeze-Omni mainly provides a possibility for researchers to conduct multimodal LLM under the condition of a frozen LLM, avoiding various impacts caused by the catastrophic forgetting of LLM caused by fewer data and training resources.
翻译:快速发展的大型语言模型(LLM)带来了巨大的智能应用。GPT-4o卓越的双工语音交互能力近期为用户带来了令人印象深刻的体验。研究人员近期已在该方向上提出了若干能够实现语音到语音对话的多模态LLM。本文提出了一种新颖的语音-文本多模态LLM架构,称为Freeze-Omni。我们的主要贡献在于,语音输入和输出模态可以轻松连接到文本LLM,同时在训练过程中保持LLM的参数完全冻结。我们为语音输入和输出的建模设计了3阶段训练策略,使得Freeze-Omni能够利用语音-文本配对数据(如ASR和TTS数据)以及仅6万轮多轮文本问答数据,在8个GPU上获得语音到语音的对话能力。此外,我们能够有效确保Freeze-Omni在语音模态下的智能水平与其骨干LLM在文本模态下的水平相当,同时语音响应的端到端延迟保持在较低水平。另外,我们还设计了一种通过多任务训练实现双工对话能力的方法,使得Freeze-Omni能够与用户进行更自然风格的对话。Freeze-Omni主要为研究人员在冻结LLM的条件下进行多模态LLM研究提供了一种可能性,避免了因数据与训练资源不足导致LLM灾难性遗忘所带来的各种影响。