Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of Vision Large Language Models (VLLMs). However, existing visual instruction tuning datasets include the following limitations: (1) Instruction annotation quality: despite existing VLLMs exhibiting strong performance, instructions generated by those advanced VLLMs may still suffer from inaccuracies, such as hallucinations. (2) Instructions and image diversity: the limited range of instruction types and the lack of diversity in image data may impact the model's ability to generate diversified and closer to real-world scenarios outputs. To address these challenges, we construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains. There are four instruction types: Judgement, Multiple-Choice, Long Visual Question Answering and Short Visual Question Answering. To construct MMInstruct, we propose an instruction generation data engine that leverages GPT-4V, GPT-3.5, and manual correction. Our instruction generation engine enables semi-automatic, low-cost, and multi-domain instruction generation at 1/6 the cost of manual construction. Through extensive experiment validation and ablation experiments, we demonstrate that MMInstruct could significantly improve the performance of VLLMs, e.g., the model fine-tuning on MMInstruct achieves new state-of-the-art performance on 10 out of 12 benchmarks. The code and data shall be available at https://github.com/yuecao0119/MMInstruct.
翻译:尽管视觉语言监督微调在提升视觉大语言模型(VLLMs)性能方面成效显著,但现有的视觉指令微调数据集存在以下局限性:(1)指令标注质量:尽管现有VLLMs表现出强大的性能,但这些先进VLLMs生成的指令仍可能存在不准确之处,例如幻觉问题。(2)指令与图像多样性:有限的指令类型范围以及图像数据多样性的缺乏,可能影响模型生成多样化且更贴近真实场景输出的能力。为应对这些挑战,我们构建了一个高质量、多样化的视觉指令微调数据集MMInstruct,该数据集包含来自24个领域的97.3万条指令。指令类型分为四种:判断型、多项选择型、长视觉问答型和短视觉问答型。为构建MMInstruct,我们提出了一个指令生成数据引擎,该引擎利用GPT-4V、GPT-3.5及人工修正。我们的指令生成引擎能够以半自动化、低成本的方式实现多领域指令生成,其成本仅为人工构建的六分之一。通过广泛的实验验证与消融实验,我们证明MMInstruct能够显著提升VLLMs的性能,例如,基于MMInstruct微调的模型在12个基准测试中的10个上取得了新的最先进性能。代码与数据将在https://github.com/yuecao0119/MMInstruct 公开。