Visual instruction tuning large language model(LLM) on image-text pairs has achieved general-purpose vision-language abilities. However, the lack of region-text pairs limits their advancements to fine-grained multimodal understanding. In this paper, we propose spatial instruction tuning, which introduces the reference to the region-of-interest(RoI) in the instruction. Before sending to LLM, the reference is replaced by RoI features and interleaved with language embeddings as a sequence. Our model GPT4RoI, trained on 7 region-text pair datasets, brings an unprecedented interactive and conversational experience compared to previous image-level models. (1) Interaction beyond language: Users can interact with our model by both language and drawing bounding boxes to flexibly adjust the referring granularity. (2) Versatile multimodal abilities: A variety of attribute information within each RoI can be mined by GPT4RoI, e.g., color, shape, material, action, etc. Furthermore, it can reason about multiple RoIs based on common sense. On the Visual Commonsense Reasoning(VCR) dataset, GPT4RoI achieves a remarkable accuracy of 81.6%, surpassing all existing models by a significant margin (the second place is 75.6%) and almost reaching human-level performance of 85.0%. The code, dataset, and demo can be found at https://github.com/jshilong/GPT4RoI.
翻译:在图像-文本对上进行视觉指令微调的大语言模型(LLM)已具备通用的视觉-语言能力。然而,区域-文本对的缺乏限制了其在细粒度多模态理解方面的进展。本文提出空间指令微调方法,在指令中引入感兴趣区域(RoI)作为参照。在输入LLM之前,该参照被替换为RoI特征,并与语言嵌入交错排列形成序列。我们的模型GPT4RoI在7个区域-文本对数据集上训练,相较于以往的图像级模型,带来了前所未有的交互式与对话式体验:(1)超越语言的交互:用户可通过语言和绘制边界框两种方式与模型交互,灵活调整所指粒度;(2)多功能多模态能力:GPT4RoI可挖掘每个RoI内的多种属性信息,例如颜色、形状、材质、动作等。此外,它还能基于常识对多个RoI进行推理。在视觉常识推理(VCR)数据集上,GPT4RoI实现了81.6%的显著准确率,以明显优势超越所有现有模型(第二名准确率为75.6%),并接近人类水平的85.0%。代码、数据集和演示可在https://github.com/jshilong/GPT4RoI获取。