Vision-Language Instruction Tuning (VLIT) is a critical training phase for Large Vision-Language Models (LVLMs). With the improving capabilities of open-source LVLMs, researchers have increasingly turned to generate VLIT data by using open-source LVLMs and achieved significant progress. However, such data generation approaches are bottlenecked by the following challenges: 1) Since multi-modal models tend to be influenced by prior language knowledge, directly using LVLMs to generate VLIT data would inevitably lead to low content relevance between generated data and images. 2) To improve the ability of the models to generate VLIT data, previous methods have incorporated an additional training phase to boost the generative capacity. This process hurts the generalization of the models to unseen inputs (i.e., "exposure bias" problem). In this paper, we propose a new Content Correlated VLIT data generation via Contrastive Learning (C3L). Specifically, we design a new content relevance module which enhances the content relevance between VLIT data and images by computing Image Instruction Correspondence Scores S(I2C). Moreover, a contrastive learning module is introduced to further boost the VLIT data generation capability of the LVLMs. A large number of automatic measures on four benchmarks show the effectiveness of our method.
翻译:摘要:视觉-语言指令微调(VLIT)是大规模视觉-语言模型(LVLMs)训练的关键阶段。随着开源LVLMs能力的提升,研究者们越来越多地利用开源LVLMs生成VLIT数据,并取得了显著进展。然而,此类数据生成方法面临以下挑战:1)由于多模态模型易受先验语言知识影响,直接使用LVLMs生成VLIT数据将不可避免地导致生成数据与图像之间的内容相关性较低;2)为提升模型生成VLIT数据的能力,先前方法引入额外训练阶段以增强生成能力,但该过程会损害模型对未见输入的泛化能力(即“曝光偏差”问题)。本文提出一种基于对比学习的内容相关VLIT数据生成方法(C3L)。具体而言,我们设计了新型内容相关性模块,通过计算图像-指令对应分数S(I2C)来增强VLIT数据与图像之间的内容相关性。此外,引入对比学习模块进一步提升了LVLMs的VLIT数据生成能力。在四个基准测试上的大量自动评估指标表明了我们方法的有效性。