Despite weakly supervised object detection (WSOD) being a promising step toward evading strong instance-level annotations, its capability is confined to closed-set categories within a single training dataset. In this paper, we propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD to detect novel concepts and utilize diverse datasets with only image-level annotations. To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment. First, we perform data-aware feature extraction to produce an input-conditional coefficient, which is leveraged into dataset attribute prototypes to identify dataset bias and help achieve cross-dataset generalization. Second, a customized location-oriented weakly supervised region proposal network is proposed to utilize high-level semantic layouts from the category-agnostic segment anything model to distinguish object boundaries. Lastly, we introduce a proposal-concept synchronized multiple-instance network, i.e., object mining and refinement with visual-semantic alignment, to discover objects matched to the text embeddings of concepts. Extensive experiments on Pascal VOC and MS COCO demonstrate that the proposed WSOVOD achieves new state-of-the-art compared with previous WSOD methods in both close-set object localization and detection tasks. Meanwhile, WSOVOD enables cross-dataset and open-vocabulary learning to achieve on-par or even better performance than well-established fully-supervised open-vocabulary object detection (FSOVOD).
翻译:尽管弱监督目标检测(WSOD)是实现规避强实例级标注的有前景方法,但其能力局限于单一训练数据集内的封闭类别。本文提出了一种新颖的弱监督开放词汇目标检测框架WSOVOD,旨在将传统WSOD拓展至检测新概念,并利用仅含图像级标注的多样化数据集。为实现该目标,我们探索了三种关键策略:数据集级特征适配、图像级显著目标定位以及区域级视觉-语言对齐。首先,通过数据感知特征提取生成输入条件系数,并将其融入数据集属性原型以识别数据集偏差,从而促进跨数据集泛化。其次,提出定制化的定位导向弱监督区域提案网络,利用类别无关的Segment Anything模型中的高层语义布局来区分目标边界。最后,引入提案-概念同步的多实例网络(即基于视觉-语义对齐的目标挖掘与精炼),以发现与概念文本嵌入匹配的目标。在Pascal VOC和MS COCO上的大量实验表明,所提WSOVOD在封闭集目标定位与检测任务中均取得了优于现有WSOD方法的最新性能。同时,WSOVOD实现了跨数据集与开放词汇学习,其性能可与甚至超越成熟的完全监督开放词汇目标检测(FSOVOD)。