Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i.e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue. With an increasing demand to evaluate multi-modal LLMs of human intention understanding and cooperation with foundation models, we outline the principles and processes and test AudioGPT in terms of consistency, capability, and robustness. Experimental results demonstrate the capabilities of AudioGPT in solving AI tasks with speech, music, sound, and talking head understanding and generation in multi-round dialogues, which empower humans to create rich and diverse audio content with unprecedented ease. Our system is publicly available at \url{https://github.com/AIGC-Audio/AudioGPT}.
翻译:大型语言模型(LLMs)在众多领域和任务中展现出卓越能力,挑战了我们对学习与认知的理解。尽管近期取得了成功,但当前的LLMs尚无法处理复杂的音频信息或进行口语对话(如Siri或Alexa)。本文提出了一种名为AudioGPT的多模态人工智能系统,该系统通过对LLMs(如ChatGPT)进行如下补充:1)利用基础模型处理复杂音频信息,解决众多理解与生成任务;2)通过输入/输出接口(ASR、TTS)支持口语对话。随着对多模态LLMs在理解人类意图及与基础模型协作方面进行评估的需求日益增长,我们概述了相关原则与流程,并从一致性、能力和鲁棒性方面对AudioGPT进行了测试。实验结果表明,AudioGPT能够在多轮对话中完成涉及语音、音乐、声音和会说话的头像理解与生成的AI任务,从而以前所未有的便捷性帮助人类创造丰富多样的音频内容。我们的系统已在\url{https://github.com/AIGC-Audio/AudioGPT}上公开发布。