Recent transformer-based architectures have shown impressive results in the field of image segmentation. Thanks to their flexibility, they obtain outstanding performance in multiple segmentation tasks, such as semantic and panoptic, under a single unified framework. To achieve such impressive performance, these architectures employ intensive operations and require substantial computational resources, which are often not available, especially on edge devices. To fill this gap, we propose Prototype-based Efficient MaskFormer (PEM), an efficient transformer-based architecture that can operate in multiple segmentation tasks. PEM proposes a novel prototype-based cross-attention which leverages the redundancy of visual features to restrict the computation and improve the efficiency without harming the performance. In addition, PEM introduces an efficient multi-scale feature pyramid network, capable of extracting features that have high semantic content in an efficient way, thanks to the combination of deformable convolutions and context-based self-modulation. We benchmark the proposed PEM architecture on two tasks, semantic and panoptic segmentation, evaluated on two different datasets, Cityscapes and ADE20K. PEM demonstrates outstanding performance on every task and dataset, outperforming task-specific architectures while being comparable and even better than computationally-expensive baselines.
翻译:摘要:近期基于Transformer的架构在图像分割领域取得了显著成果。凭借其灵活性,这些架构能在语义分割、全景分割等多种分割任务中,在统一的框架下实现卓越性能。然而,为达成如此出色的表现,这些架构需要大量计算资源和密集运算,这往往超出了边缘设备等资源受限场景的承受能力。为解决这一瓶颈,我们提出基于原型的高效MaskFormer(PEM),这是一种可适用于多种分割任务的高效Transformer架构。PEM创新性地引入了基于原型的交叉注意力机制,通过利用视觉特征的冗余性限制计算量,在保证性能的前提下提升效率。此外,PEM还设计了一种高效多尺度特征金字塔网络,该网络结合可变形卷积与基于上下文的自主调制,能够以高效方式提取高语义含量特征。我们在Cityscapes和ADE20K两个数据集上,针对语义分割与全景分割两个任务对PEM架构进行基准测试。结果表明,PEM在各任务与数据集上均展现卓越性能,不仅超越任务专用架构,其表现甚至与计算成本高昂的基线方法相当或更优。