Since the introduction of Vision Transformers, the landscape of many computer vision tasks (e.g., semantic segmentation), which has been overwhelmingly dominated by CNNs, recently has significantly revolutionized. However, the computational cost and memory requirement render these methods unsuitable on the mobile device, especially for the high-resolution per-pixel semantic segmentation task. In this paper, we introduce a new method squeeze-enhanced Axial TransFormer (SeaFormer) for mobile semantic segmentation. Specifically, we design a generic attention block characterized by the formulation of squeeze Axial and detail enhancement. It can be further used to create a family of backbone architectures with superior cost-effectiveness. Coupled with a light segmentation head, we achieve the best trade-off between segmentation accuracy and latency on the ARM-based mobile devices on the ADE20K and Cityscapes datasets. Critically, we beat both the mobile-friendly rivals and Transformer-based counterparts with better performance and lower latency without bells and whistles. Beyond semantic segmentation, we further apply the proposed SeaFormer architecture to image classification problem, demonstrating the potentials of serving as a versatile mobile-friendly backbone.
翻译:自视觉Transformer引入以来,许多长期由卷积神经网络主导的计算机视觉任务(如语义分割)领域近年来发生了显著变革。然而,计算成本与内存需求使得这些方法难以适应移动设备,尤其是针对高分辨率逐像素语义分割任务。本文提出一种新型方法——挤压增强轴向Transformer(SeaFormer),用于移动端语义分割。具体而言,我们设计了一个通用注意力模块,其核心为挤压轴向与细节增强的联合表达形式。该模块可进一步构建一系列具有卓越成本效益的骨干网络架构。结合轻量级分割头部,我们在基于ARM的移动设备上,针对ADE20K与Cityscapes数据集实现了分割精度与延迟的最佳平衡。至关重要的是,我们无需复杂技巧即可超越移动端友好型模型及Transformer类方法,获得更优性能与更低延迟。除语义分割外,我们还将所提出的SeaFormer架构进一步应用于图像分类问题,展示了其作为通用移动端友好型骨干网络的潜力。