Since the introduction of Vision Transformers, the landscape of many computer vision tasks (e.g., semantic segmentation), which has been overwhelmingly dominated by CNNs, recently has significantly revolutionized. However, the computational cost and memory requirement render these methods unsuitable on the mobile device, especially for the high-resolution per-pixel semantic segmentation task. In this paper, we introduce a new method squeeze-enhanced Axial TransFormer (SeaFormer) for mobile semantic segmentation. Specifically, we design a generic attention block characterized by the formulation of squeeze Axial and detail enhancement. It can be further used to create a family of backbone architectures with superior cost-effectiveness. Coupled with a light segmentation head, we achieve the best trade-off between segmentation accuracy and latency on the ARM-based mobile devices on the ADE20K and Cityscapes datasets. Critically, we beat both the mobile-friendly rivals and Transformer-based counterparts with better performance and lower latency without bells and whistles. Beyond semantic segmentation, we further apply the proposed SeaFormer architecture to image classification problem, demonstrating the potentials of serving as a versatile mobile-friendly backbone.
翻译:自视觉Transformer问世以来,长期由卷积神经网络主导的诸多计算机视觉任务(如语义分割)格局已发生重大变革。然而,这些方法的高计算成本与内存需求使其难以部署于移动设备,尤其对高分辨率逐像素语义分割任务而言。本文提出一种新型方法——挤压增强轴向Transformer(SeaFormer),用于移动端语义分割。具体而言,我们设计了一种通用注意力模块,其核心由挤压轴向与细节增强机制构成。该模块可进一步用于构建一系列性价比优越的骨干网络架构。结合轻量级分割头,我们在基于ARM的移动设备上,针对ADE20K与Cityscapes数据集实现了分割精度与延迟的最佳平衡。关键的是,在不依赖花哨技巧的情况下,我们以更优性能与更低延迟超越了移动端友好型竞争对手及基于Transformer的同类方法。除语义分割外,我们还将所提出的SeaFormer架构拓展至图像分类问题,展示了其作为通用移动端友好型骨干网络的潜力。