InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

General-purpose language models that can solve various language-domain tasks have emerged driven by the pre-training and instruction-tuning pipeline. However, building general-purpose vision-language models is challenging due to the increased task discrepancy introduced by the additional visual input. Although vision-language pre-training has been widely studied, vision-language instruction tuning remains relatively less explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pre-trained BLIP-2 models. We gather a wide variety of 26 publicly available datasets, transform them into instruction tuning format and categorize them into two clusters for held-in instruction tuning and held-out zero-shot evaluation. Additionally, we introduce instruction-aware visual feature extraction, a crucial method that enables the model to extract informative features tailored to the given instruction. The resulting InstructBLIP models achieve state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and the larger Flamingo. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA IMG). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models have been open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.

翻译：基于预训练与指令微调流程，能够解决各类语言领域任务的通用语言模型已经出现。然而，由于额外视觉输入带来的任务差异性增大，构建通用视觉-语言模型仍具挑战。尽管视觉-语言预训练已被广泛研究，但视觉-语言指令微调的相关探索相对较少。本文基于预训练的BLIP-2模型，对视觉-语言指令微调开展了系统全面的研究。我们收集了26个广泛可用的公开数据集，将其转化为指令微调格式，并划分为两个簇类，分别用于保留集指令微调与保留集外零样本评估。此外，我们引入了指令感知的视觉特征提取方法——这一关键技术使模型能够提取针对给定指令的信息特征。最终得到的InstructBLIP模型在所有13个保留集外数据集上均达到了最先进的零样本性能，显著优于BLIP-2及更大规模模型Flamingo。在针对单个下游任务进行微调时，我们的模型同样取得了最先进性能（例如在ScienceQA IMG上达到90.7%的准确率）。进一步地，我们通过定性分析展示了InstructBLIP相较于同期多模态模型的优势。所有InstructBLIP模型已在https://github.com/salesforce/LAVIS/tree/main/projects/instructblip开源。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【深度学习架构、模型和技巧集合(TensorFlow/PyTorch)】’Deep Learning Models - A collection of various deep learning architectures, models, and tips'

专知会员服务

59+阅读 · 2020年1月25日