The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zero-shot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics. In this paper, we present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework. While retaining all the remarkable capabilities inherent to SAM, we enhance it with the capacity to detect arbitrary objects based on human inputs like category names or reference expressions. To accomplish this, we introduce a novel SideFormer module that extracts SAM features to facilitate zero-shot object localization and inject comprehensive semantic information for open-vocabulary recognition. In addition, we devise an open-set region proposal network (Open-set RPN), enabling the detector to acquire the open-set proposals generated by SAM. Sambor demonstrates superior zero-shot performance across benchmarks, including COCO and LVIS, proving highly competitive against previous SoTA methods. We aspire for this work to serve as a meaningful endeavor in endowing SAM to recognize diverse object categories and advancing open-vocabulary learning with the support of vision foundation models.
翻译:近期的Segment Anything Model(SAM)作为一种新型范例视觉基础模型,展现出强大的零样本泛化能力和灵活提示功能。尽管SAM已在多个领域得到应用和适配,但其主要局限在于无法理解物体语义。本文提出Sambor,将SAM与开放词汇目标检测器无缝集成于端到端框架中。在保留SAM所有卓越特性的同时,我们增强其根据类别名称或指代表达等人类输入检测任意物体的能力。为此,我们引入新型SideFormer模块,该模块提取SAM特征以促进零样本目标定位,并注入全面语义信息实现开放词汇识别。此外,我们设计开放集区域建议网络(Open-set RPN),使检测器能够获取SAM生成的开放集建议。Sambor在COCO和LVIS等多个基准测试中展现出卓越的零样本性能,相较于先前最先进方法具有显著竞争力。我们期望本研究能为赋予SAM识别多样物体类别的能力、推动基于视觉基础模型的开放词汇学习发展提供有益探索。