The recently proposed MaskFormer gives a refreshed perspective on the task of semantic segmentation: it shifts from the popular pixel-level classification paradigm to a mask-level classification method. In essence, it generates paired probabilities and masks corresponding to category segments and combines them during inference for the segmentation maps. In our study, we find that per-mask classification decoder on top of a single-scale feature is not effective enough to extract reliable probability or mask. To mine for rich semantic information across the feature pyramid, we propose a transformer-based Pyramid Fusion Transformer (PFT) for per-mask approach semantic segmentation with multi-scale features. The proposed transformer decoder performs cross-attention between the learnable queries and each spatial feature from the feature pyramid in parallel and uses cross-scale inter-query attention to exchange complimentary information. We achieve competitive performance on three widely used semantic segmentation datasets. In particular, on ADE20K validation set, our result with Swin-B backbone surpasses that of MaskFormer's with a much larger Swin-L backbone in both single-scale and multi-scale inference, achieving 54.1 mIoU and 55.7 mIoU respectively. Using a Swin-L backbone, we achieve single-scale 56.1 mIoU and multi-scale 57.4 mIoU, obtaining state-of-the-art performance on the dataset. Extensive experiments on three widely used semantic segmentation datasets verify the effectiveness of our proposed method.
翻译:近期提出的MaskFormer为语义分割任务提供了全新视角:它将流行的像素级分类范式转变为掩码级分类方法。本质上,该方法生成与类别分割对应的配对概率和掩码,并在推理阶段将其组合形成分割图。本研究发现,基于单尺度特征的逐掩码分类解码器不足以有效提取可靠的概率或掩码。为挖掘特征金字塔中的丰富语义信息,我们提出基于Transformer的金字塔融合Transformer(PFT),用于多尺度特征的逐掩码语义分割。所提出的Transformer解码器并行执行可学习查询与特征金字塔中各空间特征之间的交叉注意力,并通过跨尺度查询间注意力交换互补信息。我们在三个广泛使用的语义分割数据集上取得了具有竞争力的性能。特别地,在ADE20K验证集上,采用Swin-B骨干网络的模型在单尺度和多尺度推理中均超越使用更大Swin-L骨干网络的MaskFormer,分别达到54.1 mIoU和55.7 mIoU。采用Swin-L骨干网络时,我们实现单尺度56.1 mIoU和多尺度57.4 mIoU,在该数据集上获得最先进性能。在三个广泛使用的语义分割数据集上的大量实验验证了所提出方法的有效性。