Recent transformer-based architectures have shown impressive results in the field of image segmentation. Thanks to their flexibility, they obtain outstanding performance in multiple segmentation tasks, such as semantic and panoptic, under a single unified framework. To achieve such impressive performance, these architectures employ intensive operations and require substantial computational resources, which are often not available, especially on edge devices. To fill this gap, we propose Prototype-based Efficient MaskFormer (PEM), an efficient transformer-based architecture that can operate in multiple segmentation tasks. PEM proposes a novel prototype-based cross-attention which leverages the redundancy of visual features to restrict the computation and improve the efficiency without harming the performance. In addition, PEM introduces an efficient multi-scale feature pyramid network, capable of extracting features that have high semantic content in an efficient way, thanks to the combination of deformable convolutions and context-based self-modulation. We benchmark the proposed PEM architecture on two tasks, semantic and panoptic segmentation, evaluated on two different datasets, Cityscapes and ADE20K. PEM demonstrates outstanding performance on every task and dataset, outperforming task-specific architectures while being comparable and even better than computationally-expensive baselines.
翻译:摘要:近期基于Transformer的架构在图像分割领域取得了显著成果。凭借其灵活性,这类架构能在统一的框架下于语义分割、全景分割等多类分割任务中展现卓越性能。然而,为实现如此出色的性能,这些架构需要采用密集计算操作并消耗大量计算资源,这在边缘设备等场景中往往难以满足。为填补这一空白,我们提出基于原型的高效MaskFormer(PEM)——一种能够处理多种分割任务的高效Transformer架构。PEM创新性地提出了基于原型的交叉注意力机制,该机制利用视觉特征的冗余性限制计算量,在保证性能的同时有效提升效率。此外,PEM引入高效多尺度特征金字塔网络,通过结合可变形卷积与基于上下文的自我调制,能够以高效方式提取高语义含量的特征。我们在两类任务(语义分割与全景分割)上对PEM架构进行基准测试,并基于Cityscapes和ADE20K两个数据集进行评估。实验结果表明,PEM在各项任务与数据集上均展现出卓越性能,不仅超越专用分割架构,更与计算密集型基线模型性能相当甚至更优。