Instruction tuning, a new learning paradigm that fine-tunes pre-trained language models on tasks specified through instructions, has shown promising zero-shot performance on various natural language processing tasks. However, it has yet to be explored for vision and multimodal tasks. In this work, we introduce MUL-TIINSTRUCT, the first multimodal instruction tuning benchmark dataset that consists of 62 diverse multimodal tasks in a unified seq-to-seq format covering 10 broad categories. The tasks are derived from 21 existing open-source datasets and each task is equipped with 5 expert-written instructions. We take OFA as the base pre-trained model for multimodal instruction tuning, and to further improve its zero-shot performance, we explore multiple transfer learning strategies to leverage the large-scale NATURAL INSTRUCTIONS dataset. Experimental results demonstrate strong zero-shot performance on various unseen multimodal tasks and the benefit of transfer learning from a text-only instruction dataset. We also design a new evaluation metric - Sensitivity, to evaluate how sensitive the model is to the variety of instructions. Our results indicate that fine-tuning the model on a diverse set of tasks and instructions leads to a reduced sensitivity to variations in instructions for each task.
翻译:指令调优是一种通过对预训练语言模型在由指令指定的任务上进行微调的新学习范式,已在多种自然语言处理任务中展现出有前景的零样本性能。然而,该方法尚未被探索应用于视觉和多模态任务。在本工作中,我们提出了MUL-TIINSTRUCT——首个多模态指令调优基准数据集,该数据集包含62个多样化的多模态任务,以统一的序列到序列格式覆盖10个广泛类别。这些任务源自21个现有开源数据集,每个任务配备5条专家编写的指令。我们选取OFA作为多模态指令调优的基础预训练模型,并为进一步提升其零样本性能,探索了多种迁移学习策略以利用大规模NATURAL INSTRUCTIONS数据集。实验结果表明,该方法在各种未见过的多模态任务上展现出强大的零样本性能,并且从纯文本指令数据集进行迁移学习具有优势。我们还设计了一个新的评估指标——灵敏度(Sensitivity),用于评估模型对指令多样性的敏感程度。结果表明,在多样化的任务和指令上微调模型可降低其对每个任务指令变化的敏感度。