Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models. We release our dataset, model, and demo to foster future research in the area of multimodal instruction following.
翻译:具备指令跟随能力的大型语言模型已彻底改变了人工智能领域。这些模型通过其自然语言界面展现出处理各种现实世界任务的卓越泛化能力。然而,它们的性能高度依赖于高质量示范数据,而这往往难以获取。当涉及多模态指令跟随任务时,这一挑战更为严峻。我们提出TextBind,一种几乎无需标注的框架,用于增强大型语言模型的多轮交错多模态指令跟随能力。我们的方法仅需图像-文本对,即可从语言模型生成多轮多模态指令-响应对话。为适应交错图像与文本的输入输出,我们设计了MIM——一种以语言模型为核心、无缝集成图像编码器与解码器模型的架构。我们公开了数据集、模型及演示,以推动多模态指令跟随领域的未来研究。