Recently, promptable segmentation models, such as the Segment Anything Model (SAM), have demonstrated robust zero-shot generalization capabilities on static images. These promptable models exhibit denoising abilities for imprecise prompt inputs, such as imprecise bounding boxes. In this paper, we explore the potential of applying SAM to track and segment objects in videos where we recognize the tracking task as a prompt denoising task. Specifically, we iteratively propagate the bounding box of each object's mask in the preceding frame as the prompt for the next frame. Furthermore, to enhance SAM's denoising capability against position and size variations, we propose a multi-prompt strategy where we provide multiple jittered and scaled box prompts for each object and preserve the mask prediction with the highest semantic similarity to the template mask. We also introduce a point-based refinement stage to handle occlusions and reduce cumulative errors. Without involving tracking modules, our approach demonstrates comparable performance in video object/instance segmentation tasks on three datasets: DAVIS2017, YouTubeVOS2018, and UVO, serving as a concise baseline and endowing SAM-based downstream applications with tracking capabilities.
翻译:最近,可提示分割模型(如Segment Anything Model,SAM)在静态图像上展现出强大的零样本泛化能力。这类可提示模型对不精确的提示输入(如不精确的边界框)具有去噪能力。本文探索了将SAM应用于视频中物体追踪与分割的潜力,其中我们将追踪任务视为提示去噪任务。具体而言,我们迭代地将前一帧中每个物体掩膜的边界框传播为下一帧的提示。此外,为增强SAM对位置和尺寸变化的去噪能力,我们提出了一种多提示策略——为每个物体提供多个抖动缩放后的框提示,并保留与模板掩膜语义相似度最高的掩膜预测结果。我们还引入基于点的细化阶段来处理遮挡并减少累积误差。该方法无需追踪模块,在DAVIS2017、YouTubeVOS2018和UVO三个数据集上的视频物体/实例分割任务中展现出与现有方法相当的性能,可作为简洁基线为基于SAM的下游应用赋予追踪能力。