Weakly supervised object detection (WSOD) aims at learning precise object detectors with only image-level tags. In spite of intensive research on deep learning (DL) approaches over the past few years, there is still a significant performance gap between WSOD and fully supervised object detection. In fact, most existing WSOD methods only consider the visual appearance of each region proposal but ignore employing the useful context information in the image. To this end, this paper proposes an interactive end-to-end WSDO framework called JLWSOD with two innovations: i) two types of WSOD-specific context information (i.e., instance-wise correlation andsemantic-wise correlation) are proposed and introduced into WSOD framework; ii) an interactive graph contrastive learning (iGCL) mechanism is designed to jointly optimize the visual appearance and context information for better WSOD performance. Specifically, the iGCL mechanism takes full advantage of the complementary interpretations of the WSOD, namely instance-wise detection and semantic-wise prediction tasks, forming a more comprehensive solution. Extensive experiments on the widely used PASCAL VOC and MS COCO benchmarks verify the superiority of JLWSOD over alternative state-of-the-art approaches and baseline models (improvement of 3.6%~23.3% on mAP and 3.4%~19.7% on CorLoc, respectively).
翻译:弱监督目标检测(WSOD)旨在仅利用图像级标签学习精确的目标检测器。尽管过去几年在深度学习方法上进行了大量研究,但WSOD与全监督目标检测之间仍存在显著性能差距。实际上,现有大多数WSOD方法仅考虑每个区域提议的视觉外观,而忽略了利用图像中的有效上下文信息。为此,本文提出了一种名为JLWSOD的交互式端到端弱监督目标检测框架,包含两项创新:i)提出并引入两类WSOD特定上下文信息(即实例级相关性与语义级相关性)至WSOD框架;ii)设计交互式图对比学习(iGCL)机制,通过联合优化视觉外观与上下文信息以提升WSOD性能。具体而言,iGCL机制充分利用WSOD的互补解释,即实例级检测与语义级预测任务,形成更全面的解决方案。在广泛使用的PASCAL VOC和MS COCO基准上的大量实验验证了JLWSOD相较于最先进替代方法与基线模型的优越性(在mAP上提升3.6%~23.3%,在CorLoc上提升3.4%~19.7%)。