Vision Transformers have witnessed prevailing success in a series of vision tasks. However, these Transformers often rely on extensive computational costs to achieve high performance, which is burdensome to deploy on resource-constrained devices. To alleviate this issue, we draw lessons from depthwise separable convolution and imitate its ideology to design an efficient Transformer backbone, i.e., Separable Vision Transformer, abbreviated as SepViT. SepViT helps to carry out the local-global information interaction within and among the windows in sequential order via a depthwise separable self-attention. The novel window token embedding and grouped self-attention are employed to compute the attention relationship among windows with negligible cost and establish long-range visual interactions across multiple windows, respectively. Extensive experiments on general-purpose vision benchmarks demonstrate that SepViT can achieve a state-of-the-art trade-off between performance and latency. Among them, SepViT achieves 84.2% top-1 accuracy on ImageNet-1K classification while decreasing the latency by 40%, compared to the ones with similar accuracy (e.g., CSWin). Furthermore, SepViT achieves 51.0% mIoU on ADE20K semantic segmentation task, 47.9 AP on the RetinaNet-based COCO detection task, 49.4 box AP and 44.6 mask AP on Mask R-CNN-based COCO object detection and instance segmentation tasks.
翻译:视觉Transformer在一系列视觉任务中取得了显著成功。然而,这些Transformer通常依赖高昂的计算成本来实现高性能,这使得其在资源受限设备上部署时面临挑战。为缓解此问题,我们从深度可分离卷积中汲取灵感,模仿其设计思想构建了一种高效的Transformer骨干网络,即可分离视觉Transformer(Separable Vision Transformer,简称SepViT)。SepViT通过深度可分离自注意力机制,按顺序实现窗口内与窗口间的局部-全局信息交互。其中,新颖的窗口令牌嵌入与分组自注意力分别以极低成本计算窗口间的注意力关系,并建立跨多个窗口的长程视觉交互。在通用视觉基准上的大量实验表明,SepViT能够实现性能与延迟之间的最优权衡。例如,在ImageNet-1K分类任务中,SepViT达到84.2%的Top-1准确率,同时相比精度相近的模型(如CSWin)降低40%的延迟。此外,SepViT在ADE20K语义分割任务中取得51.0%的mIoU,在基于RetinaNet的COCO检测任务中取得47.9的AP,在基于Mask R-CNN的COCO目标检测与实例分割任务中分别取得49.4的边界框AP和44.6的掩膜AP。