Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models. We release our dataset, model, and demo to foster future research in the area of multimodal instruction following.
翻译:具备指令跟随能力的大语言模型已彻底改变了人工智能领域。这些模型通过其自然语言接口展现出卓越的泛化能力,能够处理各种现实世界任务。然而,其性能严重依赖于高质量的示例数据,而此类数据往往难以获取。当涉及多模态指令跟随时,这一挑战尤为严峻。本文提出TextBind——一个近乎无需标注的框架,旨在赋予大语言模型多轮交错多模态指令跟随能力。我们的方法仅需图像-描述对,即可通过语言模型生成多轮多模态指令-响应对话。为适应交错的图像-文本输入输出,我们设计了以语言模型为核心的MIM架构,可无缝集成图像编码器与解码器模型。我们开源了数据集、模型及演示系统,以促进多模态指令跟随领域的未来研究。