CoBra: Complementary Branch Fusing Class and Semantic Knowledge for Robust Weakly Supervised Semantic Segmentation

Leveraging semantically precise pseudo masks derived from image-level class knowledge for segmentation, namely image-level Weakly Supervised Semantic Segmentation (WSSS), still remains challenging. While Class Activation Maps (CAMs) using CNNs have steadily been contributing to the success of WSSS, the resulting activation maps often narrowly focus on class-specific parts (e.g., only face of human). On the other hand, recent works based on vision transformers (ViT) have shown promising results based on their self-attention mechanism to capture the semantic parts but fail in capturing complete class-specific details (e.g., entire body parts of human but also with a dog nearby). In this work, we propose Complementary Branch (CoBra), a novel dual branch framework consisting of two distinct architectures which provide valuable complementary knowledge of class (from CNN) and semantic (from ViT) to each branch. In particular, we learn Class-Aware Projection (CAP) for the CNN branch and Semantic-Aware Projection (SAP) for the ViT branch to explicitly fuse their complementary knowledge and facilitate a new type of extra patch-level supervision. Our model, through CoBra, fuses CNN and ViT's complementary outputs to create robust pseudo masks that integrate both class and semantic information effectively. Extensive experiments qualitatively and quantitatively investigate how CNN and ViT complement each other on the PASCAL VOC 2012 dataset, showing a state-of-the-art WSSS result. This includes not only the masks generated by our model, but also the segmentation results derived from utilizing these masks as pseudo labels.

翻译：利用从图像级类别知识中获取的语义精确伪掩码进行分割（即图像级弱监督语义分割）仍具挑战性。虽然基于CNN的类激活图（CAM）持续推动着弱监督语义分割（WSSS）的发展，但生成的激活图往往狭窄聚焦于类别特定部位（如仅识别人脸）。另一方面，基于视觉Transformer（ViT）的近期工作虽凭借自注意力机制在捕获语义部件方面展现了潜力，却难以捕捉完整的类别特定细节（例如能识别人体全身但会附带邻近的狗）。本文提出互补分支（CoBra）——一种新颖的双分支框架，包含两种不同架构，可为各分支提供来自CNN的类别知识与来自ViT的语义知识这一有价值的互补信息。具体而言，我们为CNN分支学习类别感知投影（CAP），为ViT分支学习语义感知投影（SAP），以显式融合其互补知识并促成新型补丁级额外监督。通过CoBra，我们的模型融合CNN与ViT的互补输出，生成同时有效整合类别与语义信息的鲁棒伪掩码。在PASCAL VOC 2012数据集上的大量实验定性与定量揭示了CNN与ViT如何相互补充，展现了当前最优的WSSS结果——这不仅包括模型生成的掩码本身，还涵盖利用这些掩码作为伪标签进行分割所获得的性能。