This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction datasets often focus on question-answering, they struggle to generalize to broader application scenarios such as creative writing, summarization, or image analysis. To address these limitations, we propose a novel approach to constructing MM-Instruct that leverages the strong instruction-following capabilities of existing LLMs to generate novel visual instruction data from large-scale but conventional image captioning datasets. MM-Instruct first leverages ChatGPT to automatically generate diverse instructions from a small set of seed instructions through augmenting and summarization. It then matches these instructions with images and uses an open-sourced large language model (LLM) to generate coherent answers to the instruction-image pairs. The LLM is grounded by the detailed text descriptions of images in the whole answer generation process to guarantee the alignment of the instruction data. Moreover, we introduce a benchmark based on the generated instruction data to evaluate the instruction-following capabilities of existing LMMs. We demonstrate the effectiveness of MM-Instruct by training a LLaVA-1.5 model on the generated data, denoted as LLaVA-Instruct, which exhibits significant improvements in instruction-following capabilities compared to LLaVA-1.5 models. The MM-Instruct dataset, benchmark, and pre-trained models are available at https://github.com/jihaonew/MM-Instruct.
翻译:本文介绍了MM-Instruct,这是一个大规模、多样化且高质量的视觉指令数据集,旨在增强大型多模态模型(LMMs)的指令跟随能力。现有的视觉指令数据集通常侧重于问答任务,难以泛化到更广泛的应用场景,如创意写作、摘要生成或图像分析。为解决这些局限性,我们提出了一种构建MM-Instruct的新方法,该方法利用现有大型语言模型(LLMs)强大的指令跟随能力,从大规模但传统的图像描述数据集中生成新颖的视觉指令数据。MM-Instruct首先利用ChatGPT通过扩增和摘要技术,从少量种子指令自动生成多样化指令;随后将这些指令与图像匹配,并使用开源大型语言模型(LLM)为指令-图像对生成连贯的回答。在整个回答生成过程中,LLM以图像的详细文本描述为基础,确保指令数据的对齐性。此外,我们基于生成的指令数据提出了一个基准测试,用于评估现有LMMs的指令跟随能力。通过在生成数据上训练LLaVA-1.5模型(记为LLaVA-Instruct),我们验证了MM-Instruct的有效性:相比原始LLaVA-1.5模型,LLaVA-Instruct在指令跟随能力上表现出显著提升。MM-Instruct数据集、基准测试及预训练模型已发布于https://github.com/jihaonew/MM-Instruct。