Weakly Supervised Semantic Segmentation (WSSS) using only image-level labels has gained significant attention due to its cost-effectiveness. The typical framework involves using image-level labels as training data to generate pixel-level pseudo-labels with refinements. Recently, methods based on Vision Transformers (ViT) have demonstrated superior capabilities in generating reliable pseudo-labels, particularly in recognizing complete object regions, compared to CNN methods. However, current ViT-based approaches have some limitations in the use of patch embeddings, being prone to being dominated by certain abnormal patches, as well as many multi-stage methods being time-consuming and lengthy in training, thus lacking efficiency. Therefore, in this paper, we introduce a novel ViT-based WSSS method named \textit{Adaptive Patch Contrast} (APC) that significantly enhances patch embedding learning for improved segmentation effectiveness. APC utilizes an Adaptive-K Pooling (AKP) layer to address the limitations of previous max pooling selection methods. Additionally, we propose a Patch Contrastive Learning (PCL) to enhance patch embeddings, thereby further improving the final results. Furthermore, we improve upon the existing multi-stage training framework without CAM by transforming it into an end-to-end single-stage training approach, thereby enhancing training efficiency. The experimental results show that our approach is effective and efficient, outperforming other state-of-the-art WSSS methods on the PASCAL VOC 2012 and MS COCO 2014 dataset within a shorter training duration.
翻译:仅使用图像级标签的弱监督语义分割因其成本效益而受到广泛关注。典型框架涉及利用图像级标签作为训练数据,通过优化生成像素级伪标签。近年来,基于视觉Transformer的方法在生成可靠伪标签方面展现出优于CNN方法的能力,尤其在识别完整物体区域方面。然而,当前基于ViT的方法在补丁嵌入的使用上存在局限,容易受某些异常补丁主导,且许多多阶段方法训练耗时冗长,缺乏效率。为此,本文提出一种名为自适应补丁对比的新型ViT基WSSS方法,通过显著增强补丁嵌入学习来提升分割效果。APC采用自适应K池化层以解决先前最大池化选择方法的不足。此外,我们提出补丁对比学习机制来增强补丁嵌入,从而进一步提升最终结果。我们进一步改进了现有的无CAM多阶段训练框架,将其转化为端到端的单阶段训练方法,从而提升训练效率。实验结果表明,我们的方法在PASCAL VOC 2012和MS COCO 2014数据集上,以更短的训练时长超越了其他先进WSSS方法,兼具高效性与有效性。