Unsupervised semantic segmentation is a challenging task that segments images into semantic groups without manual annotation. Prior works have primarily focused on leveraging prior knowledge of semantic consistency or priori concepts from self-supervised learning methods, which often overlook the coherence property of image segments. In this paper, we demonstrate that the smoothness prior, asserting that close features in a metric space share the same semantics, can significantly simplify segmentation by casting unsupervised semantic segmentation as an energy minimization problem. Under this paradigm, we propose a novel approach called SmooSeg that harnesses self-supervised learning methods to model the closeness relationships among observations as smoothness signals. To effectively discover coherent semantic segments, we introduce a novel smoothness loss that promotes piecewise smoothness within segments while preserving discontinuities across different segments. Additionally, to further enhance segmentation quality, we design an asymmetric teacher-student style predictor that generates smoothly updated pseudo labels, facilitating an optimal fit between observations and labeling outputs. Thanks to the rich supervision cues of the smoothness prior, our SmooSeg significantly outperforms STEGO in terms of pixel accuracy on three datasets: COCOStuff (+14.9%), Cityscapes (+13.0%), and Potsdam-3 (+5.7%).
翻译:无监督语义分割是一项具有挑战性的任务,无需人工标注即可将图像分割为语义群组。现有工作主要聚焦于利用语义一致性先验知识或自监督学习方法的先验概念,却往往忽视图像分割区域的连贯性属性。本文证明,通过断言度量空间中邻近特征具有相同语义的平滑先验,可将无监督语义分割转化为能量最小化问题,从而显著简化分割任务。基于该范式,我们提出名为SmooSeg的创新方法,利用自监督学习方法建模观测值间的邻近关系作为平滑信号。为有效发现连贯语义区域,我们引入新型平滑损失函数,在促进区域内部逐段平滑的同时保持跨区域的不连续性。此外,为进一步提升分割质量,我们设计了非对称师生风格预测器,通过生成平滑更新的伪标签实现观测值与标注输出的最优拟合。得益于平滑先验的丰富监督信号,我们的SmooSeg在三个数据集上的像素精度显著超越STEGO:COCOStuff(+14.9%)、Cityscapes(+13.0%)和Potsdam-3(+5.7%)。