Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation

In recent years, weakly supervised semantic segmentation using image-level labels as supervision has received significant attention in the field of computer vision. Most existing methods have addressed the challenges arising from the lack of spatial information in these labels by focusing on facilitating supervised learning through the generation of pseudo-labels from class activation maps (CAMs). Due to the localized pattern detection of Convolutional Neural Networks (CNNs), CAMs often emphasize only the most discriminative parts of an object, making it challenging to accurately distinguish foreground objects from each other and the background. Recent studies have shown that Vision Transformer (ViT) features, due to their global view, are more effective in capturing the scene layout than CNNs. However, the use of hierarchical ViTs has not been extensively explored in this field. This work explores the use of Swin Transformer by proposing "SWTformer" to enhance the accuracy of the initial seed CAMs by bringing local and global views together. SWTformer-V1 generates class probabilities and CAMs using only the patch tokens as features. SWTformer-V2 incorporates a multi-scale feature fusion mechanism to extract additional information and utilizes a background-aware mechanism to generate more accurate localization maps with improved cross-object discrimination. Based on experiments on the PascalVOC 2012 dataset, SWTformer-V1 achieves a 0.98% mAP higher localization accuracy, outperforming state-of-the-art models. It also yields comparable performance by 0.82% mIoU on average higher than other methods in generating initial localization maps, depending only on the classification network. SWTformer-V2 further improves the accuracy of the generated seed CAMs by 5.32% mIoU, further proving the effectiveness of the local-to-global view provided by the Swin transformer.

翻译：近年来，基于图像级标签作为监督信号的弱监督语义分割在计算机视觉领域受到广泛关注。现有方法大多通过从类别激活图(CAMs)生成伪标签来促进监督学习，以应对标签缺乏空间信息带来的挑战。由于卷积神经网络(CNN)的局部模式检测特性，CAMs往往仅关注物体的最具判别性区域，导致难以准确区分前景物体与背景。最新研究表明，Vision Transformer(ViT)特征凭借其全局视野，比CNN更能有效捕捉场景布局。然而，分层ViT在该领域的应用尚未得到充分探索。本研究通过提出"SWTformer"探索Swin Transformer的应用，融合局部与全局视野以提升初始种子CAMs的精度。SWTformer-V1仅利用图像块token作为特征生成类别概率和CAMs。SWTformer-V2引入多尺度特征融合机制提取额外信息，并采用背景感知机制生成更精准的定位图，提升跨物体判别能力。基于PascalVOC 2012数据集的实验表明，SWTformer-V1的定位精度提升0.98% mAP，超越现有最优模型。在仅依赖分类网络生成初始定位图时，其平均性能较其他方法高出0.82% mIoU。SWTformer-V2进一步将生成的种子CAMs精度提升5.32% mIoU，充分验证了Swin Transformer提供的局部到全局视野的有效性。