We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM in zero-shot setting outperforms the state-of-the-art fully-finetuned referring expression comprehension and segmentation model on RefCOCOg. Code for SoM prompting is made public at: https://github.com/microsoft/SoM.
翻译:我们提出了一种新的视觉提示方法——Set-of-Mark(SoM),用于激发大型多模态模型(LMMs,如GPT-4V)的视觉定位能力。如图1(右)所示,我们利用现成交互式分割模型(如SEEM/SAM)将图像按不同粒度层级分割成若干区域,并为这些区域叠加一组标记(例如字母数字符号、掩码、边界框)。通过将带标记的图像作为输入,GPT-4V能够回答需要视觉定位的问题。我们进行了全面的实证研究,验证了SoM在广泛细粒度视觉与多模态任务中的有效性。例如,实验表明,在零样本设置下,使用SoM的GPT-4V在RefCOCOg基准上超越了现有最先进的经过全微调的指代表达理解与分割模型。SoM提示的代码已开源:https://github.com/microsoft/SoM。