Multimodal large language models (MLLMs) have recently achieved impressive general-purpose vision-language capabilities through visual instruction tuning. However, current MLLMs primarily focus on image-level or box-level understanding, falling short of achieving fine-grained vision-language alignment at the pixel level. Besides, the lack of mask-based instruction data limits their advancements. In this paper, we propose Osprey, a mask-text instruction tuning approach, to extend MLLMs by incorporating fine-grained mask regions into language instruction, aiming at achieving pixel-wise visual understanding. To achieve this goal, we first meticulously curate a mask-based region-text dataset with 724K samples, and then design a vision-language model by injecting pixel-level representation into LLM. Especially, Osprey adopts a convolutional CLIP backbone as the vision encoder and employs a mask-aware visual extractor to extract precise visual mask features from high resolution input. Experimental results demonstrate Osprey's superiority in various region understanding tasks, showcasing its new capability for pixel-level instruction tuning. In particular, Osprey can be integrated with Segment Anything Model (SAM) seamlessly to obtain multi-granularity semantics. The source code, dataset and demo can be found at https://github.com/CircleRadon/Osprey.
翻译:多模态大语言模型(MLLMs)近期通过视觉指令微调在通用视觉-语言能力上取得了显著进展。然而,现有MLLMs主要侧重于图像级或框级理解,未能实现像素级别的细粒度视觉-语言对齐。此外,缺乏基于掩码的指令数据进一步限制了其发展。本文提出Osprey——一种掩码-文本指令微调方法,通过将细粒度掩码区域融入语言指令来扩展MLLMs,旨在实现像素级视觉理解。为此,我们首先精心构建了包含724K样本的掩码区域-文本数据集,随后设计了一种将像素级表征注入大语言模型(LLM)的视觉-语言模型。特别地,Osprey采用卷积CLIP骨干网络作为视觉编码器,并利用掩码感知视觉提取器从高分辨率输入中提取精确的视觉掩码特征。实验结果表明,Osprey在多种区域理解任务中表现优越,展现出像素级指令微调的新能力。值得注意的是,Osprey可无缝集成Segment Anything Model(SAM)以获得多粒度语义。源代码、数据集和演示可在https://github.com/CircleRadon/Osprey获取。