Unsupervised image semantic segmentation(UISS) aims to match low-level visual features with semantic-level representations without outer supervision. In this paper, we address the critical properties from the view of feature alignments and feature uniformity for UISS models. We also make a comparison between UISS and image-wise representation learning. Based on the analysis, we argue that the existing MI-based methods in UISS suffer from representation collapse. By this, we proposed a robust network called Semantic Attention Network(SAN), in which a new module Semantic Attention(SEAT) is proposed to generate pixel-wise and semantic features dynamically. Experimental results on multiple semantic segmentation benchmarks show that our unsupervised segmentation framework specializes in catching semantic representations, which outperforms all the unpretrained and even several pretrained methods.
翻译:无监督图像语义分割(UISS)旨在无需外部监督即可将低层视觉特征与语义级表示进行匹配。本文从特征对齐和特征均匀性的角度探讨了UISS模型的关键特性,并对比了UISS与基于图像的表征学习。基于分析,我们认为现有的基于互信息(MI)的UISS方法存在表征坍缩问题。为此,我们提出了一种鲁棒的网络——语义注意力网络(SAN),其中设计了一个新模块语义注意力(SEAT),用于动态生成像素级和语义级特征。在多个语义分割基准上的实验结果表明,我们的无监督分割框架擅长捕捉语义表征,其性能超越了所有未预训练方法及部分预训练方法。