The Segment Anything model (SAM) has shown a generalized ability to group image pixels into patches, but applying it to semantic-aware segmentation still faces major challenges. This paper presents SAM-CP, a simple approach that establishes two types of composable prompts beyond SAM and composes them for versatile segmentation. Specifically, given a set of classes (in texts) and a set of SAM patches, the Type-I prompt judges whether a SAM patch aligns with a text label, and the Type-II prompt judges whether two SAM patches with the same text label also belong to the same instance. To decrease the complexity in dealing with a large number of semantic classes and patches, we establish a unified framework that calculates the affinity between (semantic and instance) queries and SAM patches and merges patches with high affinity to the query. Experiments show that SAM-CP achieves semantic, instance, and panoptic segmentation in both open and closed domains. In particular, it achieves state-of-the-art performance in open-vocabulary segmentation. Our research offers a novel and generalized methodology for equipping vision foundation models like SAM with multi-grained semantic perception abilities.
翻译:Segment Anything模型(SAM)已展现出将图像像素分组为片段的泛化能力,但将其应用于语义感知分割仍面临重大挑战。本文提出SAM-CP,这是一种通过建立两种超越SAM的可组合提示并将其组合以实现通用分割的简洁方法。具体而言,给定一组类别(文本形式)和一组SAM片段,I型提示判断SAM片段是否与文本标签对齐,II型提示则判断具有相同文本标签的两个SAM片段是否属于同一实例。为降低处理大量语义类别和片段时的复杂度,我们建立了统一框架,用于计算(语义与实例)查询与SAM片段之间的亲和度,并将与查询具有高亲和度的片段进行合并。实验表明,SAM-CP在开放域和封闭域中均能实现语义分割、实例分割及全景分割,尤其在开放词汇分割任务中取得了最先进的性能。本研究为SAM等视觉基础模型赋予多粒度语义感知能力提供了一种新颖且通用的方法论。