While recent progress in multimodal large language models tackles various modality tasks, they posses limited integration capabilities for complex multi-modality tasks, consequently constraining the development of the field. In this work, we take the initiative to explore and propose the LLMBind, a unified framework for modality task integration, which binds Large Language Models and corresponding pre-trained task models with task-specific tokens. Consequently, LLMBind can interpret inputs and produce outputs in versatile combinations of image, text, video, and audio. Specifically, we introduce a Mixture-of-Experts technique to enable effective learning for different multimodal tasks through collaboration among diverse experts. Furthermore, we create a multi-task dataset comprising 400k instruction data, which unlocks the ability for interactive visual generation and editing tasks. Extensive experiments show the effectiveness of our framework across various tasks, including image, video, audio generation, image segmentation, and image editing. More encouragingly, our framework can be easily extended to other modality tasks, showcasing the promising potential of creating a unified AI agent for modeling universal modalities.
翻译:尽管近期多模态大语言模型在处理多种模态任务上取得了进展,但其对复杂多模态任务的集成能力有限,从而制约了该领域的发展。本研究率先探索并提出LLMBind——一种统一的模态任务集成框架,该框架通过任务特定令牌将大语言模型与对应的预训练任务模型绑定。由此,LLMBind能够以图像、文本、视频和音频的灵活组合方式解读输入并生成输出。具体而言,我们引入混合专家技术,通过不同专家间的协作实现多模态任务的有效学习。此外,我们构建了包含40万条指令数据的多任务数据集,赋予框架交互式视觉生成与编辑任务的能力。大量实验表明,该框架在图像生成、视频生成、音频生成、图像分割及图像编辑等多种任务中均有效。更令人鼓舞的是,该框架可轻松扩展至其他模态任务,展现出构建统一通用模态建模AI代理的巨大潜力。