Multimodal large language models (MLLMs) have recently achieved impressive general-purpose vision-language capabilities through visual instruction tuning. However, current MLLMs primarily focus on image-level or box-level understanding, falling short in achieving fine-grained vision-language alignment at pixel level. Besides, the lack of mask-based instruction data limits their advancements. In this paper, we propose Osprey, a mask-text instruction tuning approach, to extend MLLMs by incorporating fine-grained mask regions into language instruction, aiming at achieving pixel-wise visual understanding. To achieve this goal, we first meticulously curate a mask-based region-text dataset with 724K samples, and then design a vision-language model by injecting pixel-level representation into LLM. Specifically, Osprey adopts a convolutional CLIP backbone as the vision encoder and employs a mask-aware visual extractor to extract precise visual mask features from high resolution input. Experimental results demonstrate Osprey's superiority in various region understanding tasks, showcasing its new capability for pixel-level instruction tuning. In particular, Osprey can be integrated with Segment Anything Model (SAM) seamlessly to obtain multi-granularity semantics. The source code, dataset and demo can be found at https://github.com/CircleRadon/Osprey.
翻译:多模态大语言模型通过视觉指令微调近期在通用视觉-语言能力上取得了显著进展。然而,现有模型主要聚焦于图像级或边界框级理解,难以实现像素级的细粒度视觉-语言对齐。此外,缺乏基于掩码的指令数据也限制了其发展。本文提出Osprey——一种掩码-文本指令微调方法,通过将细粒度掩码区域融入语言指令来扩展多模态大语言模型,旨在实现像素级视觉理解。为此,我们首先精心构建了一个包含724K样本的掩码-区域文本数据集,随后设计了一种将像素级表示注入大语言模型的视觉-语言模型。具体而言,Osprey采用卷积CLIP主干网络作为视觉编码器,并利用掩码感知视觉提取器从高分辨率输入中提取精确的视觉掩码特征。实验结果表明,Osprey在多种区域理解任务中表现优越,展现了像素级指令微调的新能力。值得注意的是,Osprey可与Segment Anything Model无缝集成以获取多粒度语义。源代码、数据集及演示程序详见https://github.com/CircleRadon/Osprey。