While recent progress in multimodal large language models tackles various modality tasks, they posses limited integration capabilities for complex multi-modality tasks, consequently constraining the development of the field. In this work, we take the initiative to explore and propose the LLMBind, a unified framework for modality task integration, which binds Large Language Models and corresponding pre-trained task models with task-specific tokens. Consequently, LLMBind can interpret inputs and produce outputs in versatile combinations of image, text, video, and audio. Specifically, we introduce a Mixture-of-Experts technique to enable effective learning for different multimodal tasks through collaboration among diverse experts. Furthermore, we create a multi-task dataset comprising 400k instruction data, which unlocks the ability for interactive visual generation and editing tasks. Extensive experiments show the effectiveness of our framework across various tasks, including image, video, audio generation, image segmentation, and image editing. More encouragingly, our framework can be easily extended to other modality tasks, showcasing the promising potential of creating a unified AI agent for modeling universal modalities.
翻译:尽管近期多模态大型语言模型在处理各种模态任务方面取得了进展,但它们在复杂多模态任务的集成能力上仍存在局限,从而制约了该领域的发展。在本工作中,我们率先探索并提出LLMBind——一种用于模态任务集成的统一框架,该框架通过任务特定令牌将大型语言模型与预训练任务模型绑定。由此,LLMBind能够解释输入并以图像、文本、视频和音频的灵活组合形式生成输出。具体而言,我们引入了混合专家技术,通过不同专家间的协作实现对多种多模态任务的有效学习。此外,我们构建了一个包含40万条指令数据的多任务数据集,解锁了交互式视觉生成与编辑任务的能力。大量实验表明,我们的框架在图像、视频、音频生成、图像分割及图像编辑等多种任务中均具有有效性。更令人鼓舞的是,该框架可轻松扩展至其他模态任务,展现了构建用于建模通用模态的统一AI代理的广阔前景。