Multi-modal large language models are regarded as a crucial step towards Artificial General Intelligence (AGI) and have garnered significant interest with the emergence of ChatGPT. However, current speech-language models typically adopt the cascade paradigm, preventing inter-modal knowledge transfer. In this paper, we propose SpeechGPT, a large language model with intrinsic cross-modal conversational abilities, capable of perceiving and generating multi-model content. With discrete speech representations, we first construct SpeechInstruct, a large-scale cross-modal speech instruction dataset. Additionally, we employ a three-stage training strategy that includes modality-adaptation pre-training, cross-modal instruction fine-tuning, and chain-of-modality instruction fine-tuning. The experimental results demonstrate that SpeechGPT has an impressive capacity to follow multi-modal human instructions and highlight the potential of handling multiple modalities with one model. Demos are shown in https://0nutation.github.io/SpeechGPT.github.io/.
翻译:多模态大型语言模型被视为迈向通用人工智能(AGI)的关键一步,并随着ChatGPT的出现引起了广泛关注。然而,当前的语音-语言模型通常采用级联范式,阻碍了模态间的知识迁移。本文提出SpeechGPT,一种具有内在跨模态对话能力的大型语言模型,能够感知和生成多模态内容。通过离散语音表征,我们首先构建了大规模跨模态语音指令数据集SpeechInstruct。此外,我们采用三阶段训练策略,包括模态适应预训练、跨模态指令微调和模态链指令微调。实验结果表明,SpeechGPT在遵循多模态人类指令方面展现出卓越能力,并凸显了用单一模型处理多种模态的潜力。演示见https://0nutation.github.io/SpeechGPT.github.io/。