The rapid development of large language models has brought many new smart applications, especially the excellent multimodal human-computer interaction in GPT-4o has brought impressive experience to users. In this background, researchers have proposed many multimodal LLMs that can achieve speech-to-speech dialogue recently. In this paper, we propose a speech-text multimodal LLM architecture called Freeze-Omni. Our main contribution is the speech input and output modalities can connected to the LLM while keeping the LLM frozen throughout the training process. We designed 3-stage training strategies both for the modeling of speech input and output, enabling Freeze-Omni to obtain speech-to-speech dialogue ability using text-speech paired data (such as ASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs. Moreover, we can effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level compared with that in the text modality of its backbone LLM, while the end-to-end latency of the spoken response achieves a low level. In addition, we also designed a method to achieve duplex dialogue ability through multi-task training, making Freeze-Omni have a more natural style of dialogue ability between the users. Freeze-Omni mainly provides a possibility for researchers to conduct multimodal LLM under the condition of a frozen LLM, avoiding various impacts caused by the catastrophic forgetting of LLM caused by fewer data and training resources.
翻译:大语言模型的快速发展催生了许多新的智能应用,特别是GPT-4o中卓越的多模态人机交互为用户带来了令人印象深刻的体验。在此背景下,研究者近期提出了许多能够实现语音到语音对话的多模态大语言模型。本文提出了一种名为Freeze-Omni的语音-文本多模态大语言模型架构。我们的主要贡献在于,语音输入和输出模态能够连接到LLM,同时在训练过程中保持LLM处于冻结状态。我们为语音输入和输出的建模设计了三阶段训练策略,使得Freeze-Omni能够利用文本-语音配对数据(如ASR和TTS数据)以及仅6万轮多轮文本问答数据,在8个GPU上获得语音到语音的对话能力。此外,我们能够有效确保Freeze-Omni在语音模态下的智能水平与其骨干LLM在文本模态下的水平相当,同时语音响应的端到端延迟保持在较低水平。另外,我们还设计了一种通过多任务训练实现全双工对话能力的方法,使Freeze-Omni在用户之间具备更自然的对话风格。Freeze-Omni主要为研究者在冻结LLM的条件下开展多模态大语言模型研究提供了一种可能性,避免了因数据与训练资源不足导致LLM发生灾难性遗忘所带来的各种影响。