Instruction tuning data is essential for training the Multimodal Large Language Models (MLLMs). However, the creation of high-quality instruction tuning data presents significant challenges. Prior methods that depended on GPT-4 for data generation were not only costly but also lacked satisfactory performance in complex tasks (i.e., grounding-based reasoning tasks). To address these issues, we developed an innovative data generation pipeline, Genixer, to generate various high-quality instruction tuning data, including nine representative tasks, e.g., Common VQA, REC, REG, and PointQ. Specifically, Genixer provides a unified solution with four key steps for alleviating the difficulty of data generation: (i) instruction data collection, (ii) instruction template design, (iii) empowering MLLM, and (iv) data generation and filtering. Subsequently, the superior qualitative results of our Genixer demonstrate that current MLLMs have a strong potential to evolve into powerful data generators. Additionally, to validate the efficacy of generated data quantitatively, we add the instruction tuning data produced by Genixer into the training of two representative MLLMs and observe the consistent improvements on various VQA tasks and multimodal benchmarks.
翻译:指令微调数据对于训练多模态大语言模型(MLLMs)至关重要。然而,高质量指令微调数据的创建面临重大挑战。以往依赖GPT-4进行数据生成的方法不仅成本高昂,且在复杂任务(如基于定位的推理任务)中表现欠佳。为解决这些问题,我们开发了创新数据生成流水线Genixer,以生成包括通用VQA、REC、REG、PointQ等九类代表性任务在内的多种高质量指令微调数据。具体而言,Genixer提供包含四个关键步骤的统一解决方案以降低数据生成难度:(i)指令数据收集、(ii)指令模板设计、(iii)赋能MLLM、(iv)数据生成与过滤。随后,Genixer的优越定性结果表明,当前MLLMs具有进化成为强大数据生成器的强大潜力。此外,为定量验证生成数据的有效性,我们将Genixer产生的指令微调数据加入两个代表性MLLMs的训练中,并观察到在各种VQA任务及多模态基准测试上的一致性能提升。