In this work, we propose a new transformer-based regularization to better localize objects for Weakly supervised semantic segmentation (WSSS). In image-level WSSS, Class Activation Map (CAM) is adopted to generate object localization as pseudo segmentation labels. To address the partial activation issue of the CAMs, consistency regularization is employed to maintain activation intensity invariance across various image augmentations. However, such methods ignore pair-wise relations among regions within each CAM, which capture context and should also be invariant across image views. To this end, we propose a new all-pairs consistency regularization (ACR). Given a pair of augmented views, our approach regularizes the activation intensities between a pair of augmented views, while also ensuring that the affinity across regions within each view remains consistent. We adopt vision transformers as the self-attention mechanism naturally embeds pair-wise affinity. This enables us to simply regularize the distance between the attention matrices of augmented image pairs. Additionally, we introduce a novel class-wise localization method that leverages the gradients of the class token. Our method can be seamlessly integrated into existing WSSS methods using transformers without modifying the architectures. We evaluate our method on PASCAL VOC and MS COCO datasets. Our method produces noticeably better class localization maps (67.3% mIoU on PASCAL VOC train), resulting in superior WSSS performances.
翻译:本文提出了一种基于Transformer的正则化方法,以更好地定位弱监督语义分割(WSSS)中的目标对象。在图像级WSSS中,类激活图(CAM)被用于生成伪分割标签作为目标定位。为解决CAM的部分激活问题,采用一致性正则化来保持不同图像增强下激活强度的不变性。然而,此类方法忽略了每个CAM内部区域间的成对关系,这些关系捕捉了上下文信息,在图像视角变化下也应具有不变性。为此,我们提出了一种新的全对一致性正则化(ACR)。给定一对增强视图,我们的方法不仅正则化这对视图间的激活强度,还确保每个视图内部区域间的亲和关系保持一致。我们采用视觉Transformer,因为其自注意力机制天然嵌入了成对亲和关系,从而可以简单地正则化增强图像对的注意力矩阵之间的距离。此外,我们还引入了一种新颖的类级定位方法,利用类别令牌的梯度。我们的方法可以无缝集成到现有使用Transformer的WSSS方法中,无需修改架构。我们在PASCAL VOC和MS COCO数据集上评估了该方法。我们的方法生成了明显更优的类别定位图(PASCAL VOC训练集上mIoU为67.3%),从而实现了更卓越的WSSS性能。