Recognize Any Regions - 专知论文

Understanding the semantics of individual regions or patches within unconstrained images, such as in open-world object detection, represents a critical yet challenging task in computer vision. Building on the success of powerful image-level vision-language (ViL) foundation models like CLIP, recent efforts have sought to harness their capabilities by either training a contrastive model from scratch with an extensive collection of region-label pairs or aligning the outputs of a detection model with image-level representations of region proposals. Despite notable progress, these approaches are plagued by computationally intensive training requirements, susceptibility to data noise, and deficiency in contextual information. To address these limitations, we explore the synergistic potential of off-the-shelf foundation models, leveraging their respective strengths in localization and semantics. We introduce a novel, generic, and efficient region recognition architecture, named RegionSpot, designed to integrate position-aware localization knowledge from a localization foundation model (e.g., SAM) with semantic information extracted from a ViL model (e.g., CLIP). To fully exploit pretrained knowledge while minimizing training overhead, we keep both foundation models frozen, focusing optimization efforts solely on a lightweight attention-based knowledge integration module. Through extensive experiments in the context of open-world object recognition, our RegionSpot demonstrates significant performance improvements over prior alternatives, while also providing substantial computational savings. For instance, training our model with 3 million data in a single day using 8 V100 GPUs. Our model outperforms GLIP by 6.5 % in mean average precision (mAP), with an even larger margin by 14.8 % for more challenging and rare categories.

翻译：理解无约束图像中各个区域或块（patches）的语义，例如开放世界目标检测任务，是计算机视觉领域一项关键而具挑战性的任务。得益于图像级视觉-语言（ViL）基础模型（如CLIP）的成功，近期研究尝试通过两种方式利用其能力：其一，从头开始使用大规模区域-标签对数据集训练对比模型；其二，将检测模型的输出与区域提议的图像级表征进行对齐。尽管取得了显著进展，这些方法仍存在训练计算量大、易受数据噪声干扰、以及缺乏上下文信息等局限。为解决上述问题，我们探索了现有基础模型的协同潜力，分别发挥其在定位与语义理解方面的优势。我们提出一种新颖、通用且高效的区域识别架构——RegionSpot，其设计思想是将定位基础模型（如SAM）中蕴含位置感知信息的定位知识与从视觉-语言模型（如CLIP）提取的语义信息进行融合。为在充分利用预训练知识的同时最小化训练开销，我们保持两个基础模型参数冻结，仅对基于注意力的轻量级知识集成模块进行优化。在开放世界目标识别场景的广泛实验中，RegionSpot在性能上显著超越先前方案，同时大幅降低计算成本。例如，使用8块V100 GPU仅需一天即可完成300万数据的模型训练。本模型在平均精度均值（mAP）上较GLIP提升6.5%，在更具挑战性的稀有类别上提升幅度更达14.8%。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日