Despite recent attention and exploration of depth for various tasks, it is still an unexplored modality for weakly-supervised object detection (WSOD). We propose an amplifier method for enhancing the performance of WSOD by integrating depth information. Our approach can be applied to any WSOD method based on multiple-instance learning, without necessitating additional annotations or inducing large computational expenses. Our proposed method employs a monocular depth estimation technique to obtain hallucinated depth information, which is then incorporated into a Siamese WSOD network using contrastive loss and fusion. By analyzing the relationship between language context and depth, we calculate depth priors to identify the bounding box proposals that may contain an object of interest. These depth priors are then utilized to update the list of pseudo ground-truth boxes, or adjust the confidence of per-box predictions. Our proposed method is evaluated on six datasets (COCO, PASCAL VOC, Conceptual Captions, Clipart1k, Watercolor2k, and Comic2k) by implementing it on top of two state-of-the-art WSOD methods, and we demonstrate a substantial enhancement in performance.
翻译:尽管近期深度信息在各种任务中受到关注并得到探索,但在弱监督目标检测领域,它仍是一个未开发的模态。我们提出一种增强方法,通过整合深度信息来提升弱监督目标检测性能。该方法可应用于任何基于多实例学习的弱监督目标检测方法,无需额外标注或增加大量计算开销。我们采用单目深度估计技术获取幻觉深度信息,并通过对比损失与融合将其接入孪生弱监督目标检测网络。通过分析语言上下文与深度之间的关系,我们计算深度先验以识别可能包含感兴趣目标的边界框提议。这些深度先验随后用于更新伪真实标注框列表,或调整每个框预测的置信度。我们在六个数据集(COCO、PASCAL VOC、Conceptual Captions、Clipart1k、Watercolor2k 和 Comic2k)上,基于两种最先进的弱监督目标检测方法对所提方法进行验证,结果表明性能得到显著提升。