Instruction tuning large language model (LLM) on image-text pairs has achieved unprecedented vision-language multimodal abilities. However, their vision-language alignments are only built on image-level, the lack of region-level alignment limits their advancements to fine-grained multimodal understanding. In this paper, we propose instruction tuning on region-of-interest. The key design is to reformulate the bounding box as the format of spatial instruction. The interleaved sequences of visual features extracted by the spatial instruction and the language embedding are input to LLM, and trained on the transformed region-text data in instruction tuning format. Our region-level vision-language model, termed as GPT4RoI, brings brand new conversational and interactive experience beyond image-level understanding. (1) Controllability: Users can interact with our model by both language and spatial instructions to flexibly adjust the detail level of the question. (2) Capacities: Our model supports not only single-region spatial instruction but also multi-region. This unlocks more region-level multimodal capacities such as detailed region caption and complex region reasoning. (3) Composition: Any off-the-shelf object detector can be a spatial instruction provider so as to mine informative object attributes from our model, like color, shape, material, action, relation to other objects, etc. The code, data, and demo can be found at https://github.com/jshilong/GPT4RoI.
翻译:基于图像-文本对的指令微调大语言模型已展现出前所未有的视觉-语言多模态能力。然而,现有视觉-语言对齐仅建立在图像层面,区域级对齐的缺失限制了其在细粒度多模态理解领域的发展。本文提出面向区域兴趣的指令微调方法,核心设计在于将边界框重构为空间指令格式。通过空间指令提取的视觉特征与语言嵌入组成交错序列后输入大语言模型,并在指令微调格式下的转换区域-文本数据上进行训练。我们提出的区域级视觉-语言模型GPT4RoI,实现了超越图像级理解的创新型对话与交互体验:(1)可控性:用户可通过语言与空间指令双重方式与模型交互,灵活调节问题的细节层次;(2)能力:模型不仅支持单区域空间指令,更支持多区域输入,从而解锁精细区域描述、复杂区域推理等区域级多模态能力;(3)可组合性:任意现成目标检测器均可作为空间指令提供器,从模型中挖掘对象属性信息(如颜色、形状、材质、动作、与其他对象的关系等)。相关代码、数据及演示系统可访问 https://github.com/jshilong/GPT4RoI。