Weakly-Supervised Semantic Segmentation (WSSS) using image-level labels typically utilizes Class Activation Map (CAM) to generate the pseudo labels. Limited by the local structure perception of CNN, CAM usually cannot identify the integral object regions. Though the recent Vision Transformer (ViT) can remedy this flaw, we observe it also brings the over-smoothing issue, \ie, the final patch tokens incline to be uniform. In this work, we propose Token Contrast (ToCo) to address this issue and further explore the virtue of ViT for WSSS. Firstly, motivated by the observation that intermediate layers in ViT can still retain semantic diversity, we designed a Patch Token Contrast module (PTC). PTC supervises the final patch tokens with the pseudo token relations derived from intermediate layers, allowing them to align the semantic regions and thus yield more accurate CAM. Secondly, to further differentiate the low-confidence regions in CAM, we devised a Class Token Contrast module (CTC) inspired by the fact that class tokens in ViT can capture high-level semantics. CTC facilitates the representation consistency between uncertain local regions and global objects by contrasting their class tokens. Experiments on the PASCAL VOC and MS COCO datasets show the proposed ToCo can remarkably surpass other single-stage competitors and achieve comparable performance with state-of-the-art multi-stage methods. Code is available at https://github.com/rulixiang/ToCo.
翻译:弱监督语义分割(WSSS)利用图像级标签通常依赖类别激活图(CAM)生成伪标签。受限于卷积神经网络的局部结构感知能力,CAM往往无法识别完整的物体区域。尽管近期提出的Vision Transformer(ViT)可弥补这一缺陷,但我们观察到其同样带来过平滑问题,即最终的分块令牌趋于同质化。本文提出令牌对比(Token Contrast,ToCo)方法来解决该问题,并进一步挖掘ViT在弱监督语义分割中的优势。首先,基于ViT中间层仍能保留语义多样性的观测,我们设计了分块令牌对比模块(PTC)。该模块利用中间层生成的伪令牌关系监督最终分块令牌,使其对齐语义区域并生成更精确的CAM。其次,为区分CAM中的低置信度区域,受ViT中类别令牌可捕获高层语义的启发,我们设计了类别令牌对比模块(CTC)。CTC通过对比不确定局部区域与全局对象的类别令牌,促进二者表征一致性。在PASCAL VOC和MS COCO数据集上的实验表明,所提ToCo方法显著超越其他单阶段方法,并达到与先进多阶段方法相当的性能。代码地址:https://github.com/rulixiang/ToCo。