Semantic segmentation assigns labels to pixels in images, a critical yet challenging task in computer vision. Convolutional methods, although capturing local dependencies well, struggle with long-range relationships. Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands, especially for high-resolution inputs. Most research optimizes the encoder architecture, leaving the bottleneck underexplored - a key area for enhancing performance and efficiency. We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation. The framework's efficiency is driven by three synergistic modules: the Token Pyramid Extraction Module (TPEM) for hierarchical multi-scale representation, the Transformer and Modulating DepthwiseConv (Trans-MDC) block for dynamic scale-aware feature modeling, and the Feature Merging Module (FMM) for robust integration with enhanced spatial and contextual consistency. Extensive experiments on ADE20K, Pascal Context, CityScapes, and COCO-Stuff datasets show ContextFormer significantly outperforms existing models, achieving state-of-the-art mIoU scores, setting a new benchmark for efficiency and performance. The codes will be made publicly available.
翻译:语义分割任务旨在为图像中的像素分配标签,是计算机视觉领域中至关重要且极具挑战性的课题。卷积方法虽能有效捕捉局部依赖关系,但在处理长程关联时存在局限。视觉Transformer(ViT)在全局上下文捕获方面表现卓越,但其高昂的计算成本,尤其是面对高分辨率输入时,构成了主要瓶颈。现有研究多聚焦于编码器架构的优化,而对瓶颈部分的探索相对不足——该部分正是提升性能与效率的关键所在。本文提出ContextFormer,一种在瓶颈处融合CNN与ViT优势的混合框架,旨在为实时语义分割任务实现效率、精度与鲁棒性的平衡。该框架的高效性由三个协同模块驱动:用于分层多尺度表示的Token金字塔提取模块(TPEM)、用于动态尺度感知特征建模的Transformer与调制深度卷积(Trans-MDC)块,以及通过增强空间与上下文一致性实现鲁棒集成的特征融合模块(FMM)。在ADE20K、Pascal Context、CityScapes和COCO-Stuff数据集上的大量实验表明,ContextFormer显著优于现有模型,取得了最先进的mIoU分数,为效率与性能设立了新的基准。相关代码将公开提供。