There is an emerging line of research on multimodal instruction tuning, and a line of benchmarks has been proposed for evaluating these models recently. Instead of evaluating the models directly, in this paper, we try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets. Also, we seek the way of building a dataset for developing an all-powerful VLIT model, which we believe could also be of utility for establishing a grounded protocol for benchmarking VLIT models. For effective evaluation of VLIT datasets that remains an open question, we propose a tune-cross-evaluation paradigm: tuning on one dataset and evaluating on the others in turn. For each single tune-evaluation experiment set, we define the Meta Quality (MQ) as the mean score obtained by a set of caption metrics including BLEU, METEOR, and ROUGE-L to quantify the quality of a certain dataset or a sample. On this basis, to evaluate the comprehensiveness of a dataset, we develop the Dataset Quality (DQ) covering all tune-evaluation sets. To lay the foundation for building a comprehensive dataset and developing an all-powerful model for practical applications, we define the Sample Quality (SQ) to quantify the all-sided quality of each sample. Extensive experiments validate the rationality of the proposed evaluation paradigm. Based on the holistic evaluation, we build a new dataset, REVO-LION (REfining VisiOn-Language InstructiOn tuNing), by collecting samples with higher SQ from each dataset. Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up. Furthermore, REVO-LION not only facilitates the development of a powerful model but also incorporates an evaluation set, which is designed to serve as a convenient benchmark for future research in the field.
翻译:多模态指令调优领域的研究日益兴起,近期已有多个基准测试用于评估此类模型。不同于直接评估模型,本文尝试评估视觉-语言指令调优(VLIT)数据集本身,同时探索构建能够开发全能型VLIT模型的数据集方法——我们相信这一工作也有助于建立评估VLIT模型的标准化基准协议。针对VLIT数据集有效评估这一开放性问题,我们提出交叉调优评估范式:依次在某个数据集上调优模型,并在其他数据集上进行评估。对于每组单次调优-评估实验,我们定义元质量(MQ)为包括BLEU、METEOR、ROUGE-L在内的若干图像描述指标得分的均值,以量化特定数据集或样本的质量。在此基础上,为评估数据集的全面性,我们发展出覆盖所有调优-评估组合的数据集质量(DQ)指标。为构建综合数据集和开发面向实际应用的全能型模型奠定基础,我们定义了样本质量(SQ)以量化每个样本的全面质量。大量实验验证了所提评估范式的合理性。基于整体评估,我们通过从各数据集中选取高SQ样本构建了新数据集REVO-LION(视觉-语言指令调优优化)。值得注意的是,仅使用完整数据量的一半,在REVO-LION上训练的模型即可达到简单合并所有VLIT数据集时的性能水平。此外,REVO-LION不仅推动了强大模型的发展,还内嵌了评估集,旨在为领域未来研究提供便捷基准。