The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method incorporating (1) memory bank construction; (2) representation aggregation and (3) semantic-aware feature matching. Our experiments show significant improvements on segmentation metrics, leading to state-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50) and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark (22.4% nAP).
翻译:图像分割模型的性能历来受到大规模标注数据收集成本高昂的限制。分割一切模型(SAM)通过一种可提示、语义无关的分割范式缓解了这一原始问题,但仍需手动视觉提示或复杂的领域相关提示生成规则来处理新图像。为减轻这一新负担,本研究探讨了在仅提供少量参考图像情况下的物体分割任务。我们的核心思路是利用基础模型所学习的强语义先验,识别参考图像与目标图像之间的对应区域。我们发现这些对应关系能够为下游任务自动生成实例级分割掩码,并通过一个多阶段、无需训练的方法实现这一构想,该方法包含:(1)记忆库构建;(2)表征聚合与(3)语义感知特征匹配。实验结果表明,本方法在分割指标上取得显著提升,在COCO FSOD(36.8% nAP)、PASCAL VOC小样本分割(71.2% nAP50)上达到最先进性能,并在跨域FSOD基准测试(22.4% nAP)上优于现有无训练方法。