Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models. We release our dataset, model, and demo to foster future research in the area of multimodal instruction following.
翻译:具备指令跟随能力的大型语言模型已经彻底改变了人工智能领域。这些模型通过其自然语言界面展现出卓越的泛化能力,能够处理各种现实世界的任务。然而,它们的性能高度依赖于高质量示例数据,而这些数据往往难以获取。当涉及多模态指令跟随任务时,这一挑战变得更加严峻。我们提出了TextBind,一个几乎无需标注的框架,旨在赋予大型语言模型多轮交错多模态指令跟随能力。我们的方法仅需要图像-文本对,并从语言模型中生成多轮多模态指令-响应对话。为了适应交错图像-文本输入和输出,我们设计了MIM,一种以语言模型为中心的架构,能够无缝集成图像编码器和解码器模型。我们发布了数据集、模型和演示,以促进未来在多模态指令跟随领域的研究。