Unsupervised semantic segmentation aims to label each pixel of an image to a corresponding class without the use of annotated data. It is a widely researched area as obtaining labeled datasets are expensive. While previous works in the field demonstrated a gradual improvement in segmentation performance, most of them required neural network training. This made segmentation equally expensive, especially when dealing with large-scale datasets. We thereby propose a lightweight clustering framework for unsupervised semantic segmentation. Attention features of the self-supervised vision transformer exhibit strong foreground-background differentiability. By clustering these features into a small number of clusters, we could separate foreground and background image patches into distinct groupings. In our clustering framework, we first obtain attention features from the self-supervised vision transformer. Then we extract Dataset-level, Category-level and Image-level masks by clustering features within the same dataset, category and image. We further ensure multilevel clustering consistency across the three levels and this allows us to extract patch-level binary pseudo-masks. Finally, the pseudo-mask is upsampled, refined and class assignment is performed according to the CLS token of object regions. Our framework demonstrates great promise in unsupervised semantic segmentation and achieves state-of-the-art results on PASCAL VOC and MS COCO datasets.
翻译:无监督语义分割旨在无需标注数据的情况下,为图像中的每个像素分配相应的类别标签。由于获取已标注数据集的成本高昂,该领域已成为广泛研究的方向。尽管现有工作展示了分割性能的渐进式提升,但大多数方法仍需训练神经网络。这使得分割过程同样昂贵,尤其当处理大规模数据集时。为此,我们提出了一种面向无监督语义分割的轻量级聚类框架。自监督视觉Transformer的注意力特征展现出显著的前景-背景可区分性。通过将这些特征聚类为少量簇,我们可将图像块的前景与背景分离至不同分组中。在我们的聚类框架中,首先从自监督视觉Transformer中提取注意力特征,随后通过在同一数据集、类别及图像内聚类特征,分别提取数据集级、类别级和图像级掩码。我们进一步确保三个层级间的多级聚类一致性,从而提取出图像块级别的二值伪掩码。最后,对伪掩码进行上采样与精炼,并根据对象区域的CLS标记进行类别分配。该框架在无监督语义分割任务中展现出巨大潜力,并在PASCAL VOC和MS COCO数据集上取得了当前最优性能。