Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models. We release our dataset, model, and demo to foster future research in the area of multimodal instruction following.
翻译:具备指令跟随能力的大型语言模型彻底变革了人工智能领域。这些模型通过自然语言接口处理各类现实任务时展现出卓越的泛化能力。然而,其性能高度依赖于高质量示例数据,而这类数据往往难以获取。当涉及多模态指令跟随任务时,这一挑战更为突出。我们提出TextBind——一种近乎免标注的框架,能够赋予大型语言模型多轮交错多模态指令跟随能力。该方法仅需图像-描述对,即可由语言模型生成多轮多模态指令-响应对话。为兼容交错式图文输入输出,我们设计了以语言模型为核心的MIM架构,该架构无缝集成图像编码器与解码器模型。我们公开了数据集、模型及演示系统,旨在促进多模态指令跟随领域的未来研究。