General-purpose language models that can solve various language-domain tasks have emerged driven by the pre-training and instruction-tuning pipeline. However, building general-purpose vision-language models is challenging due to the increased task discrepancy introduced by the additional visual input. Although vision-language pre-training has been widely studied, vision-language instruction tuning remains relatively less explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pre-trained BLIP-2 models. We gather a wide variety of 26 publicly available datasets, transform them into instruction tuning format and categorize them into two clusters for held-in instruction tuning and held-out zero-shot evaluation. Additionally, we introduce instruction-aware visual feature extraction, a crucial method that enables the model to extract informative features tailored to the given instruction. The resulting InstructBLIP models achieve state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and the larger Flamingo. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA IMG). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models have been open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.
翻译:基于预训练与指令微调流程,能够解决各类语言领域任务的通用语言模型已经出现。然而,由于额外视觉输入带来的任务差异性增大,构建通用视觉-语言模型仍具挑战。尽管视觉-语言预训练已被广泛研究,但视觉-语言指令微调的相关探索相对较少。本文基于预训练的BLIP-2模型,对视觉-语言指令微调开展了系统全面的研究。我们收集了26个广泛可用的公开数据集,将其转化为指令微调格式,并划分为两个簇类,分别用于保留集指令微调与保留集外零样本评估。此外,我们引入了指令感知的视觉特征提取方法——这一关键技术使模型能够提取针对给定指令的信息特征。最终得到的InstructBLIP模型在所有13个保留集外数据集上均达到了最先进的零样本性能,显著优于BLIP-2及更大规模模型Flamingo。在针对单个下游任务进行微调时,我们的模型同样取得了最先进性能(例如在ScienceQA IMG上达到90.7%的准确率)。进一步地,我们通过定性分析展示了InstructBLIP相较于同期多模态模型的优势。所有InstructBLIP模型已在https://github.com/salesforce/LAVIS/tree/main/projects/instructblip开源。