While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community. Project page: https://next-gpt.github.io/
翻译:尽管多模态大语言模型(MM-LLMs)近期取得了令人振奋的进展,但它们大多局限于仅能理解多模态输入,而无法生成多模态内容。由于人类始终通过多种模态感知世界并进行交流,开发能够接收和传递任意模态内容的任意到任意MM-LLM,对于实现人类水平的人工智能至关重要。为填补这一空白,我们提出了一个端到端的通用任意到任意MM-LLM系统——NExT-GPT。通过将大语言模型与多模态适配器及不同的扩散解码器相连接,NExT-GPT能够以文本、图像、视频和音频的任意组合形式感知输入并生成输出。通过利用现有训练成熟的高性能编码器与解码器,NExT-GPT仅需对特定投影层的少量参数(1%)进行微调,这不仅有利于低成本训练,也便于未来扩展到更多潜在模态。此外,我们提出了模态切换指令微调技术(MosIT),并为此手动构建了高质量数据集,基于此NExT-GPT获得了复杂的跨模态语义理解与内容生成能力。总体而言,我们的研究展示了构建能够建模通用模态的人工智能代理的可行前景,为学界开展更类人化的人工智能研究开辟了道路。项目主页:https://next-gpt.github.io/