Salient object detection (SOD) aims to segment visually prominent regions in images and serves as a foundational task for various computer vision applications. We posit that SOD can now reach near-supervised accuracy without a single pixel-level label, but only when reliable pseudo-masks are available. We revisit the prototype-based line of work and make two key observations. First, boundary pixels and interior pixels obey markedly different geometry; second, the global consistency enforced by optimal transport (OT) is underutilized if prototype quality is weak. To address this, we introduce POTNet, an adaptation of Prototypical Optimal Transport that replaces POT's single k-means step with an entropy-guided dual-clustering head: high-entropy pixels are organized by spectral clustering, low-entropy pixels by k-means, and the two prototype sets are subsequently aligned by OT. This split-fuse-transport design yields sharper, part-aware pseudo-masks in a single forward pass, without handcrafted priors. Those masks supervise a standard MaskFormer-style encoder-decoder, giving rise to AutoSOD, an end-to-end unsupervised SOD pipeline that eliminates SelfMask's offline voting yet improves both accuracy and training efficiency. Extensive experiments on five benchmarks show that AutoSOD outperforms unsupervised methods by up to 26% and weakly supervised methods by up to 36% in F-measure, further narrowing the gap to fully supervised models.
翻译:显著性目标检测(SOD)旨在分割图像中视觉上突出的区域,是众多计算机视觉应用的基础任务。我们认为,SOD现在可以在无需任何像素级标注的情况下达到接近有监督方法的精度,但这仅在有可靠的伪掩码可用时才成立。我们重新审视了基于原型的方法,并得出两个关键观察:首先,边界像素与内部像素遵循明显不同的几何特性;其次,若原型质量较弱,最优传输(OT)所强制的全局一致性则未被充分利用。为此,我们提出了POTNet,它是对原型最优传输方法的改进,用熵引导的双重聚类头取代了原方法中的单一k-means步骤:高熵像素通过谱聚类组织,低熵像素通过k-means组织,随后通过OT对齐这两组原型集。这种分割-融合-传输的设计在单次前向传播中即可生成更清晰、具有部件感知的伪掩码,且无需手工先验。这些掩码用于监督一个标准的MaskFormer式编码器-解码器,从而形成了AutoSOD——一个端到端的无监督SOD流程,它消除了SelfMask的离线投票机制,同时提升了精度和训练效率。在五个基准数据集上的大量实验表明,AutoSOD在F-measure指标上优于无监督方法达26%,优于弱监督方法达36%,进一步缩小了与全监督模型的差距。