Since the introduction of Vision Transformers, the landscape of many computer vision tasks (e.g., semantic segmentation), which has been overwhelmingly dominated by CNNs, recently has significantly revolutionized. However, the computational cost and memory requirement render these methods unsuitable on the mobile device, especially for the high-resolution per-pixel semantic segmentation task. In this paper, we introduce a new method squeeze-enhanced Axial TransFormer (SeaFormer) for mobile semantic segmentation. Specifically, we design a generic attention block characterized by the formulation of squeeze Axial and detail enhancement. It can be further used to create a family of backbone architectures with superior cost-effectiveness. Coupled with a light segmentation head, we achieve the best trade-off between segmentation accuracy and latency on the ARM-based mobile devices on the ADE20K and Cityscapes datasets. Critically, we beat both the mobile-friendly rivals and Transformer-based counterparts with better performance and lower latency without bells and whistles. Beyond semantic segmentation, we further apply the proposed SeaFormer architecture to image classification problem, demonstrating the potentials of serving as a versatile mobile-friendly backbone.
翻译:自Vision Transformers引入以来,许多长期由CNN主导的计算机视觉任务(如语义分割)的格局已发生显著变革。然而,这些方法的计算成本和存储需求使其不适用于移动设备,尤其是针对高分辨率逐像素语义分割任务。本文提出一种新的方法——压缩增强轴向Transformer(SeaFormer),用于移动端语义分割。具体而言,我们设计了一种通用注意力模块,其核心在于压缩轴向和细节增强的混合架构。该模块可进一步用于构建一系列具有高性价比的骨干网络。结合轻量级分割头,我们在基于ARM的移动设备上,针对ADE20K和Cityscapes数据集实现了分割精度与延迟的最佳平衡。关键的是,我们无需任何额外技巧即可在性能和速度上同时超越移动端友好型模型和基于Transformer的同类模型。除语义分割外,我们还将所提出的SeaFormer架构应用于图像分类问题,展示了其作为通用移动端友好型骨干网络的潜力。