MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment

This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction datasets often focus on question-answering, they struggle to generalize to broader application scenarios such as creative writing, summarization, or image analysis. To address these limitations, we propose a novel approach to constructing MM-Instruct that leverages the strong instruction-following capabilities of existing LLMs to generate novel visual instruction data from large-scale but conventional image captioning datasets. MM-Instruct first leverages ChatGPT to automatically generate diverse instructions from a small set of seed instructions through augmenting and summarization. It then matches these instructions with images and uses an open-sourced large language model (LLM) to generate coherent answers to the instruction-image pairs. The LLM is grounded by the detailed text descriptions of images in the whole answer generation process to guarantee the alignment of the instruction data. Moreover, we introduce a benchmark based on the generated instruction data to evaluate the instruction-following capabilities of existing LMMs. We demonstrate the effectiveness of MM-Instruct by training a LLaVA-1.5 model on the generated data, denoted as LLaVA-Instruct, which exhibits significant improvements in instruction-following capabilities compared to LLaVA-1.5 models. The MM-Instruct dataset, benchmark, and pre-trained models are available at https://github.com/jihaonew/MM-Instruct.

翻译：本文介绍了MM-Instruct，这是一个大规模、多样化且高质量的视觉指令数据集，旨在增强大型多模态模型（LMMs）的指令跟随能力。现有的视觉指令数据集通常侧重于问答任务，难以泛化到更广泛的应用场景，如创意写作、摘要生成或图像分析。为解决这些局限性，我们提出了一种构建MM-Instruct的新方法，该方法利用现有大型语言模型（LLMs）强大的指令跟随能力，从大规模但传统的图像描述数据集中生成新颖的视觉指令数据。MM-Instruct首先利用ChatGPT通过扩增和摘要技术，从少量种子指令自动生成多样化指令；随后将这些指令与图像匹配，并使用开源大型语言模型（LLM）为指令-图像对生成连贯的回答。在整个回答生成过程中，LLM以图像的详细文本描述为基础，确保指令数据的对齐性。此外，我们基于生成的指令数据提出了一个基准测试，用于评估现有LMMs的指令跟随能力。通过在生成数据上训练LLaVA-1.5模型（记为LLaVA-Instruct），我们验证了MM-Instruct的有效性：相比原始LLaVA-1.5模型，LLaVA-Instruct在指令跟随能力上表现出显著提升。MM-Instruct数据集、基准测试及预训练模型已发布于https://github.com/jihaonew/MM-Instruct。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/