We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. To extract the continuous features of versatile regions, we propose a spatial-aware visual sampler, adept at handling varying sparsity across different shapes. Consequently, Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes. To bolster the desired capability of Ferret, we curate GRIT, a comprehensive refer-and-ground instruction tuning dataset including 1.1M samples that contain rich hierarchical spatial knowledge, with 95K hard negative data to promote model robustness. The resulting model not only achieves superior performance in classical referring and grounding tasks, but also greatly outperforms existing MLLMs in region-based and localization-demanded multimodal chatting. Our evaluations also reveal a significantly improved capability of describing image details and a remarkable alleviation in object hallucination. Code and data will be available at https://github.com/apple/ml-ferret
翻译:我们提出Ferret,一种新型多模态大语言模型(MLLM),能够理解图像中任意形状或粒度的空间指代,并精准定位开放词汇描述。为统一大语言模型范式下的指代与定位,Ferret采用新颖且强大的混合区域表示方法,将离散坐标与连续特征相结合,共同表征图像中的区域。为提取多样化区域的连续特征,我们提出空间感知视觉采样器,可有效处理不同形状的稀疏性差异。由此,Ferret可接受点、边界框、自由形状等多种区域输入。为增强Ferret的所需能力,我们构建了GRIT——包含110万样本的全面指代-定位指令微调数据集,具有丰富的层次化空间知识,并包含9.5万困难负样本以提升模型鲁棒性。所得模型不仅在经典指代与定位任务中表现优越,更在区域导向和定位需求的多模态对话中大幅超越现有MLLM。评估还表明,模型描述图像细节的能力显著提升,且目标幻觉现象得到大幅缓解。代码与数据将发布于https://github.com/apple/ml-ferret