A Visual Representation-guided Framework with Global Affinity for Weakly Supervised Salient Object Detection

Fully supervised salient object detection (SOD) methods have made considerable progress in performance, yet these models rely heavily on expensive pixel-wise labels. Recently, to achieve a trade-off between labeling burden and performance, scribble-based SOD methods have attracted increasing attention. Previous scribble-based models directly implement the SOD task only based on SOD training data with limited information, it is extremely difficult for them to understand the image and further achieve a superior SOD task. In this paper, we propose a simple yet effective framework guided by general visual representations with rich contextual semantic knowledge for scribble-based SOD. These general visual representations are generated by self-supervised learning based on large-scale unlabeled datasets. Our framework consists of a task-related encoder, a general visual module, and an information integration module to efficiently combine the general visual representations with task-related features to perform the SOD task based on understanding the contextual connections of images. Meanwhile, we propose a novel global semantic affinity loss to guide the model to perceive the global structure of the salient objects. Experimental results on five public benchmark datasets demonstrate that our method, which only utilizes scribble annotations without introducing any extra label, outperforms the state-of-the-art weakly supervised SOD methods. Specifically, it outperforms the previous best scribble-based method on all datasets with an average gain of 5.5% for max f-measure, 5.8% for mean f-measure, 24% for MAE, and 3.1% for E-measure. Moreover, our method achieves comparable or even superior performance to the state-of-the-art fully supervised models.

翻译：全监督显著目标检测（SOD）方法在性能上取得了显著进展，但这些模型高度依赖昂贵的像素级标签。近年来，为在标注负担与性能之间取得平衡，基于涂鸦的SOD方法引起了越来越多的关注。以往基于涂鸦的模型仅依靠信息有限的SOD训练数据直接执行任务，这使其难以理解图像内容，进而难以实现优越的SOD性能。本文提出一种简洁而有效的框架，该框架由具有丰富上下文语义知识的通用视觉表征引导，用于涂鸦式SOD。这些通用视觉表征通过基于大规模无标注数据集的自监督学习生成。我们的框架包含任务相关编码器、通用视觉模块和信息整合模块，旨在高效融合通用视觉表征与任务相关特征，从而基于对图像上下文关联的理解执行SOD任务。同时，我们提出一种新颖的全局语义亲和损失函数，引导模型感知显著目标的全局结构。在五个公开基准数据集上的实验结果表明，仅利用涂鸦标注且不引入任何额外标签的方法，其性能超越了当前最先进的弱监督SOD方法。具体而言，它在所有数据集上均优于此前最佳的涂鸦式方法，平均最大F-measure提升5.5%，平均F-measure提升5.8%，MAE降低24%，E-measure提升3.1%。此外，我们的方法达到了与最先进全监督模型相当甚至更优的性能。