Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global categories within an image corpus without any form of annotation. Building upon recent advances in self-supervised representation learning, we focus on how to leverage these large pre-trained models for the downstream task of unsupervised segmentation. We present PriMaPs - Principal Mask Proposals - decomposing images into semantically meaningful masks based on their feature representation. This allows us to realize unsupervised semantic segmentation by fitting class prototypes to PriMaPs with a stochastic expectation-maximization algorithm, PriMaPs-EM. Despite its conceptual simplicity, PriMaPs-EM leads to competitive results across various pre-trained backbone models, including DINO and DINOv2, and across datasets, such as Cityscapes, COCO-Stuff, and Potsdam-3. Importantly, PriMaPs-EM is able to boost results when applied orthogonally to current state-of-the-art unsupervised semantic segmentation pipelines.
翻译:无监督语义分割旨在无需任何标注的情况下,通过识别图像库中的全局类别,自动将图像分割成语义上有意义的区域。基于自监督表示学习的最新进展,我们专注于如何利用这些预训练大模型完成无监督分割的下游任务。我们提出PriMaPs(主掩码提议)方法,该方法根据图像的特征表示将其分解成语义有意义的掩码。通过随机期望最大化算法(PriMaPs-EM)将类原型拟合至PriMaPs,即可实现无监督语义分割。尽管概念简洁,PriMaPs-EM在DINO和DINOv2等多种预训练骨干模型上,以及Cityscapes、COCO-Stuff和Potsdam-3等数据集上均取得了具有竞争力的结果。重要的是,当与当前最先进的无监督语义分割流程正交结合时,PriMaPs-EM能够进一步提升性能。