The Segment Anything Model (SAM) has established itself as a powerful zero-shot image segmentation model, employing interactive prompts such as points to generate masks. This paper presents SAM-PT, a method extending SAM's capability to tracking and segmenting anything in dynamic videos. SAM-PT leverages robust and sparse point selection and propagation techniques for mask generation, demonstrating that a SAM-based segmentation tracker can yield strong zero-shot performance across popular video object segmentation benchmarks, including DAVIS, YouTube-VOS, and MOSE. Compared to traditional object-centric mask propagation strategies, we uniquely use point propagation to exploit local structure information that is agnostic to object semantics. We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark. To further enhance our approach, we utilize K-Medoids clustering for point initialization and track both positive and negative points to clearly distinguish the target object. We also employ multiple mask decoding passes for mask refinement and devise a point re-initialization strategy to improve tracking accuracy. Our code integrates different point trackers and video segmentation benchmarks and will be released at https://github.com/SysCV/sam-pt.
翻译:段任意模型(SAM)已确立为强大的零样本图像分割模型,通过使用点等交互式提示生成掩码。本文提出SAM-PT方法,将SAM的能力扩展到动态视频中任意目标的跟踪与分割。SAM-PT利用稳健且稀疏的点选择与传播技术进行掩码生成,证明基于SAM的分割跟踪器可在包括DAVIS、YouTube-VOS和MOSE在内的主流视频目标分割基准上取得强零样本性能。与传统的以目标为中心的掩码传播策略不同,我们独创性地采用点传播来利用与目标语义无关的局部结构信息。通过在零样本开放世界未知视频目标(UVO)基准上的直接评估,我们凸显了点跟踪的优势。为进一步提升方法性能,我们采用K-Medoids聚类进行点初始化,并同时跟踪正负点以清晰区分目标物体。此外,我们还采用多次掩码解码通路进行掩码精炼,并设计了点重新初始化策略以改善跟踪精度。我们的代码集成了不同的点跟踪器与视频分割基准,并将在https://github.com/SysCV/sam-pt开源。