Our paper introduces a novel two-stage self-supervised approach for detecting co-occurring salient objects (CoSOD) in image groups without requiring segmentation annotations. Unlike existing unsupervised methods that rely solely on patch-level information (e.g. clustering patch descriptors) or on computation heavy off-the-shelf components for CoSOD, our lightweight model leverages feature correspondences at both patch and region levels, significantly improving prediction performance. In the first stage, we train a self-supervised network that detects co-salient regions by computing local patch-level feature correspondences across images. We obtain the segmentation predictions using confidence-based adaptive thresholding. In the next stage, we refine these intermediate segmentations by eliminating the detected regions (within each image) whose averaged feature representations are dissimilar to the foreground feature representation averaged across all the cross-attention maps (from the previous stage). Extensive experiments on three CoSOD benchmark datasets show that our self-supervised model outperforms the corresponding state-of-the-art models by a huge margin (e.g. on the CoCA dataset, our model has a 13.7% F-measure gain over the SOTA unsupervised CoSOD model). Notably, our self-supervised model also outperforms several recent fully supervised CoSOD models on the three test datasets (e.g., on the CoCA dataset, our model has a 4.6% F-measure gain over a recent supervised CoSOD model).
翻译:本文提出了一种新颖的两阶段自监督方法,用于在图像组中检测共现显著目标(CoSOD),且无需分割标注。与现有仅依赖图像块级信息(例如聚类块描述符)或计算繁重的现成组件的无监督CoSOD方法不同,我们的轻量级模型充分利用了图像块和区域两个层级的特征对应关系,显著提升了预测性能。在第一阶段,我们训练一个自监督网络,通过计算跨图像的局部图像块特征对应来检测共显著区域,并采用基于置信度的自适应阈值获得分割预测。第二阶段,我们通过消除每张图像中平均特征表示与前一阶段跨注意力图平均前景特征表示不相似的检测区域,来优化这些中间分割结果。在三个CoSOD基准数据集上进行的大量实验表明,我们的自监督模型以大幅优势超越了当前最优的无监督方法(例如在CoCA数据集上,我们的模型相比SOTA无监督CoSOD模型F-measure提升了13.7%)。值得注意的是,在三个测试数据集上,我们的自监督模型甚至优于多个近期全监督CoSOD模型(例如在CoCA数据集上,我们的模型相比近期有监督CoSOD模型F-measure提升了4.6%)。