CoBra: Complementary Branch Fusing Class and Semantic Knowledge for Robust Weakly Supervised Semantic Segmentation

Leveraging semantically precise pseudo masks derived from image-level class knowledge for segmentation, namely image-level Weakly Supervised Semantic Segmentation (WSSS), still remains challenging. While Class Activation Maps (CAMs) using CNNs have steadily been contributing to the success of WSSS, the resulting activation maps often narrowly focus on class-specific parts (e.g., only face of human). On the other hand, recent works based on vision transformers (ViT) have shown promising results based on their self-attention mechanism to capture the semantic parts but fail in capturing complete class-specific details (e.g., entire body parts of human but also with a dog nearby). In this work, we propose Complementary Branch (CoBra), a novel dual branch framework consisting of two distinct architectures which provide valuable complementary knowledge of class (from CNN) and semantic (from ViT) to each branch. In particular, we learn Class-Aware Projection (CAP) for the CNN branch and Semantic-Aware Projection (SAP) for the ViT branch to explicitly fuse their complementary knowledge and facilitate a new type of extra patch-level supervision. Our model, through CoBra, fuses CNN and ViT's complementary outputs to create robust pseudo masks that integrate both class and semantic information effectively. Extensive experiments qualitatively and quantitatively investigate how CNN and ViT complement each other on the PASCAL VOC 2012 dataset, showing a state-of-the-art WSSS result. This includes not only the masks generated by our model, but also the segmentation results derived from utilizing these masks as pseudo labels.

翻译：基于图像级类别知识生成语义精确的伪掩码用于分割，即图像级弱监督语义分割（WSSS）仍具挑战性。尽管基于CNN的类激活图（CAM）持续推动WSSS领域发展，但生成的激活图往往聚焦于类别特定局部区域（如仅关注人类面部）。另一方面，近期基于视觉Transformer（ViT）的研究凭借其自注意力机制在捕获语义部件方面取得进展，却难以完整捕捉类别级细节（如虽能识别人体全身却同时包含邻近犬类）。本文提出互补分支框架（CoBra），该新型双分支框架包含两种不同架构，可分别为各分支提供来自CNN的类别知识与来自ViT的语义知识这两种互补信息。具体而言，我们为CNN分支设计类别感知投影（CAP），为ViT分支设计语义感知投影（SAP），显式融合两者互补知识，并促进新型补丁级额外监督。通过CoBra框架，模型融合CNN与ViT的互补输出，生成同时包含类别与语义信息的鲁棒伪掩码。在PASCAL VOC 2012数据集上的大量定性及定量实验揭示了CNN与ViT的互补机制，并取得了当前最优的WSSS结果——这不仅体现在模型生成的掩码质量上，还体现在基于这些伪掩码作为标签的分割性能上。