Co-saliency detection within a single image is a common vision problem that has received little attention and has not yet been well addressed. Existing methods often used a bottom-up strategy to infer co-saliency in an image in which salient regions are firstly detected using visual primitives such as color and shape and then grouped and merged into a co-saliency map. However, co-saliency is intrinsically perceived complexly with bottom-up and top-down strategies combined in human vision. To address this problem, this study proposes a novel end-to-end trainable network comprising a backbone net and two branch nets. The backbone net uses ground-truth masks as top-down guidance for saliency prediction, whereas the two branch nets construct triplet proposals for regional feature mapping and clustering, which drives the network to be bottom-up sensitive to co-salient regions. We construct a new dataset of 2,019 natural images with co-saliency in each image to evaluate the proposed method. Experimental results show that the proposed method achieves state-of-the-art accuracy with a running speed of 28 fps.
翻译:单幅图像内的共显著性检测是一个常见但尚未得到充分关注及妥善解决的视觉问题。现有方法通常采用自底向上策略来推断图像中的共显著性,即首先利用颜色和形状等视觉基元检测显著区域,随后将其分组并合并为共显著性图。然而,人类视觉系统对共显著性的感知本质上融合了自底向上与自顶向下两种策略,具有复杂性。为解决该问题,本研究提出一种新颖的端到端可训练网络,其包含一个骨干网络和两个分支网络。骨干网络利用真实标注掩膜作为自顶向下的引导进行显著性预测,而两个分支网络则构建三元组提议以实现区域特征映射与聚类,从而驱动网络对共显著区域具备自底向上的敏感性。我们构建了一个包含2,019张自然图像的新数据集(每张图像包含共显著性)以评估所提方法。实验结果表明,该方法达到了最先进的精度,运行速度为28帧/秒。