Recent transformer-based architectures have shown impressive results in the field of image segmentation. Thanks to their flexibility, they obtain outstanding performance in multiple segmentation tasks, such as semantic and panoptic, under a single unified framework. To achieve such impressive performance, these architectures employ intensive operations and require substantial computational resources, which are often not available, especially on edge devices. To fill this gap, we propose Prototype-based Efficient MaskFormer (PEM), an efficient transformer-based architecture that can operate in multiple segmentation tasks. PEM proposes a novel prototype-based cross-attention which leverages the redundancy of visual features to restrict the computation and improve the efficiency without harming the performance. In addition, PEM introduces an efficient multi-scale feature pyramid network, capable of extracting features that have high semantic content in an efficient way, thanks to the combination of deformable convolutions and context-based self-modulation. We benchmark the proposed PEM architecture on two tasks, semantic and panoptic segmentation, evaluated on two different datasets, Cityscapes and ADE20K. PEM demonstrates outstanding performance on every task and dataset, outperforming task-specific architectures while being comparable and even better than computationally-expensive baselines.
翻译:近期基于Transformer的架构在图像分割领域取得了显著成果。凭借其灵活性,这些架构能在统一的框架下完成语义分割和全景分割等多个任务,并展现出卓越性能。然而,为实现如此亮眼的表现,这些架构采用密集型运算并消耗大量计算资源,这在边缘设备等场景中往往难以实现。为填补这一空白,我们提出基于原型的高效MaskFormer(PEM),这是一种可执行多种分割任务的高效Transformer架构。PEM创新性地提出了基于原型的交叉注意力机制,通过利用视觉特征的冗余性限制计算量,在提升效率的同时不损害性能。此外,PEM引入了高效的多尺度特征金字塔网络,该网络结合可变形卷积与基于上下文的自我调制,能够以高效方式提取高语义含量的特征。我们在语义分割和全景分割两项任务上对PEM架构进行基准测试,并在Cityscapes和ADE20K两个数据集上展开评估。实验表明,PEM在各项任务和数据集上均展现出卓越性能,不仅超越专用任务架构,甚至能与计算代价高昂的基线模型相媲美或表现更优。