MLLM-DataEngine: An Iterative Refinement Approach for MLLM

Despite the great advance of Multimodal Large Language Models (MLLMs) in both instruction dataset building and benchmarking, the independence of training and evaluation makes current MLLMs hard to further improve their capability under the guidance of evaluation results with a relatively low human cost. In this paper, we propose MLLM-DataEngine, a novel closed-loop system that bridges data generation, model training, and evaluation. Within each loop iteration, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results, then generate a proper incremental dataset for the next training iteration and enhance the model capability iteratively. Compared with previous data collection methods which are separate from the benchmarking, the data generated by MLLM-DataEngine shows better targeting, quality, and correctness. For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data within each incremental dataset based on the benchmarking results. For quality, we resort to GPT-4 to generate high-quality data with each given data type. For correctness, prompt design is critical for the data generation results. Rather than previous hand-crafted prompt, we propose an Interactive Prompt Optimization strategy, which optimizes the prompt with the multi-round interaction between human and GPT, and improve the correctness of generated data greatly. Through extensive experiments, we find our MLLM-DataEngine could boost the MLLM capability in a targeted and automatic manner, with only a few human participation. We hope it could be a general solution for the following MLLMs building. The MLLM-DataEngine has been open-sourced and is now available at https://github.com/opendatalab/MLLM-DataEngine.

翻译：尽管多模态大语言模型（MLLM）在指令数据集构建与基准测试方面取得了显著进展，但训练与评估的独立性使得当前MLLM难以在较低人力成本下，依据评估结果持续提升自身能力。本文提出MLLM-DataEngine——一种融合数据生成、模型训练与评估的新型闭环系统。每次迭代循环中，MLLM-DataEngine首先基于评估结果分析模型薄弱环节，随后生成适配的增量数据集用于下一轮训练，从而迭代增强模型能力。相较于以往与基准测试分离的数据收集方法，MLLM-DataEngine生成的数据在靶向性、质量与正确性方面表现更优。为提升靶向性，我们提出自适应缺陷样本采样模块，该模块基于基准测试结果动态调整增量数据集中各类数据的比例。在质量层面，我们借助GPT-4对每种数据类型生成高质量数据。针对正确性这一关键问题，提示设计对数据生成结果至关重要。不同于传统人工编写提示的方式，我们提出交互式提示优化策略，通过人与GPT的多轮交互优化提示，显著提升生成数据的正确性。大量实验表明，MLLM-DataEngine仅需少量人工参与即可实现靶向性、自动化的MLLM能力提升。我们期望该方案能成为后续MLLM构建的通用解决方案。MLLM-DataEngine已开源，地址为https://github.com/opendatalab/MLLM-DataEngine。