Multimodal large language models (MLLMs) have recently achieved impressive general-purpose vision-language capabilities through visual instruction tuning. However, current MLLMs primarily focus on image-level or box-level understanding, falling short of achieving fine-grained vision-language alignment at the pixel level. Besides, the lack of mask-based instruction data limits their advancements. In this paper, we propose Osprey, a mask-text instruction tuning approach, to extend MLLMs by incorporating fine-grained mask regions into language instruction, aiming at achieving pixel-wise visual understanding. To achieve this goal, we first meticulously curate a mask-based region-text dataset with 724K samples, and then design a vision-language model by injecting pixel-level representation into LLM. Especially, Osprey adopts a convolutional CLIP backbone as the vision encoder and employs a mask-aware visual extractor to extract precise visual mask features from high resolution input. Experimental results demonstrate Osprey's superiority in various region understanding tasks, showcasing its new capability for pixel-level instruction tuning. In particular, Osprey can be integrated with Segment Anything Model (SAM) seamlessly to obtain multi-granularity semantics. The source code, dataset and demo can be found at https://github.com/CircleRadon/Osprey.
翻译:多模态大语言模型(MLLMs)近期通过视觉指令微调在通用视觉语言能力上取得了显著进展。然而,当前MLLMs主要关注图像级或框级理解,未能实现像素级的细粒度视觉语言对齐。此外,基于掩码的指令数据缺失限制了其进一步发展。本文提出Osprey——一种掩码文本指令微调方法——通过将细粒度掩码区域融入语言指令来扩展MLLMs,旨在实现像素级视觉理解。为此,我们首先精心整理了一个包含724K样本的掩码区域文本数据集,随后设计了一个视觉语言模型,将像素级表征注入大语言模型。具体而言,Osprey采用卷积CLIP骨干网络作为视觉编码器,并利用掩码感知视觉提取器从高分辨率输入中提取精确的视觉掩码特征。实验结果表明,Osprey在多种区域理解任务中具有优越性,展现了其像素级指令微调的新能力。特别地,Osprey可与Segment Anything Model(SAM)无缝集成,获取多粒度语义。源代码、数据集及演示可在https://github.com/CircleRadon/Osprey获取。