Given a group of images, co-salient object detection (CoSOD) aims to highlight the common salient object in each image. There are two factors closely related to the success of this task, namely consensus extraction, and the dispersion of consensus to each image. Most previous works represent the group consensus using local features, while we instead utilize a hierarchical Transformer module for extracting semantic-level consensus. Therefore, it can obtain a more comprehensive representation of the common object category, and exclude interference from other objects that share local similarities with the target object. In addition, we propose a Transformer-based dispersion module that takes into account the variation of the co-salient object in different scenes. It distributes the consensus to the image feature maps in an image-specific way while making full use of interactions within the group. These two modules are integrated with a ViT encoder and an FPN-like decoder to form an end-to-end trainable network, without additional branch and auxiliary loss. The proposed method is evaluated on three commonly used CoSOD datasets and achieves state-of-the-art performance.
翻译:给定一组图像,共显著性目标检测旨在突出每幅图像中的共同显著目标。该任务的成功与两个因素密切相关,即共识提取与共识向每幅图像的分散。以往的大多数工作使用局部特征来表示群体共识,而我们则利用分层Transformer模块提取语义级共识。因此,它能获得对共同目标类别更全面的表示,并排除与目标对象存在局部相似性的其他物体的干扰。此外,我们提出了一种基于Transformer的分散模块,该模块考虑了共显著目标在不同场景中的变化。它以图像特定的方式将共识分配到图像特征图,同时充分利用群体内的交互。这两个模块与ViT编码器和类FPN解码器集成,构成一个端到端可训练的网络,无需额外的分支和辅助损失。所提方法在三个常用的共显著性目标检测数据集上进行了评估,并取得了最先进的性能。