P2AT: Pyramid Pooling Axial Transformer for Real-time Semantic Segmentation

Recently, Transformer-based models have achieved promising results in various vision tasks, due to their ability to model long-range dependencies. However, transformers are computationally expensive, which limits their applications in real-time tasks such as autonomous driving. In addition, an efficient local and global feature selection and fusion are vital for accurate dense prediction, especially driving scene understanding tasks. In this paper, we propose a real-time semantic segmentation architecture named Pyramid Pooling Axial Transformer (P2AT). The proposed P2AT takes a coarse feature from the CNN encoder to produce scale-aware contextual features, which are then combined with the multi-level feature aggregation scheme to produce enhanced contextual features. Specifically, we introduce a pyramid pooling axial transformer to capture intricate spatial and channel dependencies, leading to improved performance on semantic segmentation. Then, we design a Bidirectional Fusion module (BiF) to combine semantic information at different levels. Meanwhile, a Global Context Enhancer is introduced to compensate for the inadequacy of concatenating different semantic levels. Finally, a decoder block is proposed to help maintain a larger receptive field. We evaluate P2AT variants on three challenging scene-understanding datasets. In particular, our P2AT variants achieve state-of-art results on the Camvid dataset 80.5%, 81.0%, 81.1% for P2AT-S, P2ATM, and P2AT-L, respectively. Furthermore, our experiment on Cityscapes and Pascal VOC 2012 have demonstrated the efficiency of the proposed architecture, with results showing that P2AT-M, achieves 78.7% on Cityscapes. The source code will be available at

翻译：近年来，基于Transformer的模型因其能够建模长距离依赖关系，在各类视觉任务中取得了显著成效。然而，Transformer计算成本高昂，限制了其在自动驾驶等实时任务中的应用。此外，高效的局部与全局特征选择与融合对于精确的密集预测（尤其是驾驶场景理解任务）至关重要。本文提出一种名为金字塔池化轴向Transformer（P2AT）的实时语义分割架构。该架构从CNN编码器中提取粗粒度特征以生成尺度感知的上下文特征，随后通过多级特征聚合方案将其组合，产生增强的上下文特征。具体而言，我们引入金字塔池化轴向Transformer来捕获复杂的空间与通道依赖关系，从而提升语义分割性能。同时，设计了双向融合模块（BiF）以整合不同层级的语义信息；引入全局上下文增强器补偿不同语义层级拼接的不足；并最终提出一个解码器模块以维持更大的感受野。我们在三个具有挑战性的场景理解数据集上评估了P2AT变体。特别地，P2AT-S、P2AT-M和P2AT-L在Camvid数据集上分别达到80.5%、81.0%和81.1%的最新水平。此外，在Cityscapes与Pascal VOC 2012上的实验证明了所提架构的有效性，其中P2AT-M在Cityscapes上取得78.7%的结果。源代码将发布于。