Understanding the semantics of individual regions or patches within unconstrained images, such as in open-world object detection, represents a critical yet challenging task in computer vision. Building on the success of powerful image-level vision-language (ViL) foundation models like CLIP, recent efforts have sought to harness their capabilities by either training a contrastive model from scratch with an extensive collection of region-label pairs or aligning the outputs of a detection model with image-level representations of region proposals. Despite notable progress, these approaches are plagued by computationally intensive training requirements, susceptibility to data noise, and deficiency in contextual information. To address these limitations, we explore the synergistic potential of off-the-shelf foundation models, leveraging their respective strengths in localization and semantics. We introduce a novel, generic, and efficient region recognition architecture, named RegionSpot, designed to integrate position-aware localization knowledge from a localization foundation model (e.g., SAM) with semantic information extracted from a ViL model (e.g., CLIP). To fully exploit pretrained knowledge while minimizing training overhead, we keep both foundation models frozen, focusing optimization efforts solely on a lightweight attention-based knowledge integration module. Through extensive experiments in the context of open-world object recognition, our RegionSpot demonstrates significant performance improvements over prior alternatives, while also providing substantial computational savings. For instance, training our model with 3 million data in a single day using 8 V100 GPUs. Our model outperforms GLIP by 6.5 % in mean average precision (mAP), with an even larger margin by 14.8 % for more challenging and rare categories.
翻译:理解无约束图像中各个区域或块(patches)的语义,例如开放世界目标检测任务,是计算机视觉领域一项关键而具挑战性的任务。得益于图像级视觉-语言(ViL)基础模型(如CLIP)的成功,近期研究尝试通过两种方式利用其能力:其一,从头开始使用大规模区域-标签对数据集训练对比模型;其二,将检测模型的输出与区域提议的图像级表征进行对齐。尽管取得了显著进展,这些方法仍存在训练计算量大、易受数据噪声干扰、以及缺乏上下文信息等局限。为解决上述问题,我们探索了现有基础模型的协同潜力,分别发挥其在定位与语义理解方面的优势。我们提出一种新颖、通用且高效的区域识别架构——RegionSpot,其设计思想是将定位基础模型(如SAM)中蕴含位置感知信息的定位知识与从视觉-语言模型(如CLIP)提取的语义信息进行融合。为在充分利用预训练知识的同时最小化训练开销,我们保持两个基础模型参数冻结,仅对基于注意力的轻量级知识集成模块进行优化。在开放世界目标识别场景的广泛实验中,RegionSpot在性能上显著超越先前方案,同时大幅降低计算成本。例如,使用8块V100 GPU仅需一天即可完成300万数据的模型训练。本模型在平均精度均值(mAP)上较GLIP提升6.5%,在更具挑战性的稀有类别上提升幅度更达14.8%。