Weakly Supervised Semantic Segmentation (WSSS), which relies only on image-level labels, has attracted significant attention for its cost-effectiveness and scalability. Existing methods mainly enhance inter-class distinctions and employ data augmentation to mitigate semantic ambiguity and reduce spurious activations. However, they often neglect the complex contextual dependencies among image patches, resulting in incomplete local representations and limited segmentation accuracy. To address these issues, we propose the Context Patch Fusion with Class Token Enhancement (CPF-CTE) framework, which exploits contextual relations among patches to enrich feature representations and improve segmentation. At its core, the Contextual-Fusion Bidirectional Long Short-Term Memory (CF-BiLSTM) module captures spatial dependencies between patches and enables bidirectional information flow, yielding a more comprehensive understanding of spatial correlations. This strengthens feature learning and segmentation robustness. Moreover, we introduce learnable class tokens that dynamically encode and refine class-specific semantics, enhancing discriminative capability. By effectively integrating spatial and semantic cues, CPF-CTE produces richer and more accurate representations of image content. Extensive experiments on PASCAL VOC 2012 and MS COCO 2014 validate that CPF-CTE consistently surpasses prior WSSS methods.
翻译:弱监督语义分割仅依赖图像级标签,因其成本效益与可扩展性而受到广泛关注。现有方法主要通过增强类间区分度并采用数据增强来缓解语义模糊性、减少伪激活。然而,这些方法往往忽略了图像补丁间复杂的上下文依赖关系,导致局部表征不完整且分割精度受限。为解决这些问题,我们提出基于类标记增强的上下文补丁融合框架,该框架利用补丁间的上下文关系来丰富特征表征并提升分割性能。其核心组件——上下文融合双向长短期记忆模块能够捕获补丁间的空间依赖关系,实现双向信息流,从而获得更全面的空间相关性理解。这强化了特征学习与分割鲁棒性。此外,我们引入可学习的类标记,动态编码并优化类特定语义,增强了判别能力。通过有效整合空间与语义线索,CPF-CTE能够生成更丰富、更准确的图像内容表征。在PASCAL VOC 2012和MS COCO 2014数据集上的大量实验表明,CPF-CTE consistently surpasses prior WSSS methods。