The integration of visual encoders and large language models (LLMs) has driven recent progress in multimodal large language models (MLLMs). However, the scarcity of high-quality instruction-tuning data for vision-language tasks remains a challenge. The current leading paradigm, such as LLaVA, relies on language-only GPT-4 to generate data, which requires pre-annotated image captions and detection bounding boxes, suffering from understanding image details. A practical solution to this problem would be to utilize the available multimodal large language models (MLLMs) to generate instruction data for vision-language tasks. However, it's worth noting that the currently accessible MLLMs are not as powerful as their LLM counterparts, as they tend to produce inadequate responses and generate false information. As a solution for addressing the current issue, this paper proposes the Visual Instruction Generation and Correction (VIGC) framework that enables multimodal large language models to generate instruction-tuning data and progressively enhance its quality on-the-fly. Specifically, Visual Instruction Generation (VIG) guides the vision-language model to generate diverse instruction-tuning data. To ensure generation quality, Visual Instruction Correction (VIC) adopts an iterative update mechanism to correct any inaccuracies in data produced by VIG, effectively reducing the risk of hallucination. Leveraging the diverse, high-quality data generated by VIGC, we finetune mainstream models and validate data quality based on various evaluations. Experimental results demonstrate that VIGC not only compensates for the shortcomings of language-only data generation methods, but also effectively enhances the benchmark performance. The models, datasets, and code are available at https://opendatalab.github.io/VIGC.
翻译:视觉编码器与大型语言模型(LLMs)的融合推动了多模态大型语言模型(MLLMs)的最新进展。然而,面向视觉-语言任务的高质量指令调优数据稀缺仍是挑战。当前主流范式如LLaVA依赖于纯语言GPT-4生成数据,这需要预先标注的图像标题和检测边界框,且难以捕捉图像细节。解决该问题的可行方案是利用现有MLLMs生成视觉-语言任务的指令数据。但值得注意的是,当前可访问的MLLMs能力不及对应的LLMs,容易产生不充分的响应并生成虚假信息。本文针对现有问题提出视觉指令生成与校正(VIGC)框架,使多模态大语言模型能够生成指令调优数据,并即时逐步提升其质量。具体而言,视觉指令生成(VIG)引导视觉-语言模型生成多样化的指令调优数据。为确保生成质量,视觉指令校正(VIC)采用迭代更新机制修正VIG生成数据中的错误,有效降低幻觉风险。利用VIGC生成的多样化高质量数据,我们对主流模型进行微调,并通过多种评估验证数据质量。实验结果表明,VIGC不仅弥补了纯语言数据生成方法的不足,还有效提升了基准性能。模型、数据集及代码开源地址:https://opendatalab.github.io/VIGC。