High-quality instructions and responses are essential for the zero-shot performance of large language models on interactive natural language tasks. For interactive vision-language tasks involving intricate visual scenes, a large quantity of diverse and creative instruction-response pairs should be imperative to tune vision-language models (VLMs). Nevertheless, the current availability of vision-language instruction-response pairs in terms of quantity, diversity, and creativity remains limited, posing challenges to the generalization of interactive VLMs. Here we present MultI-Modal In-Context Instruction Tuning (MIMIC-IT), a dataset comprising 2.8 million multimodal instruction-response pairs, with 2.2 million unique instructions derived from images and videos. Each pair is accompanied by multi-modal in-context information, forming conversational contexts aimed at empowering VLMs in perception, reasoning, and planning. The instruction-response collection process, dubbed as Syphus, is scaled using an automatic annotation pipeline that combines human expertise with GPT's capabilities. Using the MIMIC-IT dataset, we train a large VLM named Otter. Based on extensive evaluations conducted on vision-language benchmarks, it has been observed that Otter demonstrates remarkable proficiency in multi-modal perception, reasoning, and in-context learning. Human evaluation reveals it effectively aligns with the user's intentions. We release the MIMIC-IT dataset, instruction-response collection pipeline, benchmarks, and the Otter model.
翻译:高质量的指令与响应对于大语言模型在交互式自然语言任务上的零样本性能至关重要。针对涉及复杂视觉场景的交互式视觉-语言任务,需要大量多样化且富有创意的指令-响应对来微调视觉语言模型。然而,当前可获取的视觉-语言指令-响应对在数量、多样性和创意性方面仍存在局限,这给交互式视觉语言模型的泛化带来了挑战。本文提出了多模态情境指令微调(MIMIC-IT)数据集,包含280万条多模态指令-响应对,其中220万条独特指令源自图像和视频。每个响应对都附带多模态情境信息,形成旨在增强视觉语言模型在感知、推理和规划能力的对话上下文。指令-响应收集流程被称为Syphus,基于结合人类专业知识与GPT能力的自动标注流水线进行扩展。利用MIMIC-IT数据集,我们训练了一个名为Otter的大型视觉语言模型。基于在视觉-语言基准测试上的广泛评估,观察到Otter在多模态感知、推理和情境学习方面表现出色。人工评估表明,它能有效对齐用户意图。我们开源了MIMIC-IT数据集、指令-响应收集流水线、基准测试以及Otter模型。