UniFormer: Unifying Convolution and Self-attention for Visual Recognition

from arxiv, 18 pages, 10 figures, 23 tables. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

It is a challenging task to learn discriminative representation from images and videos, due to large local redundancy and complex global dependency in these visual data. Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years. Though CNNs can efficiently decrease local redundancy by convolution within a small neighborhood, the limited receptive field makes it hard to capture global dependency. Alternatively, ViTs can effectively capture long-range dependency via self-attention, while blind similarity comparisons among all the tokens lead to high redundancy. To resolve these problems, we propose a novel Unified transFormer (UniFormer), which can seamlessly integrate the merits of convolution and self-attention in a concise transformer format. Different from the typical transformer blocks, the relation aggregators in our UniFormer block are equipped with local and global token affinity respectively in shallow and deep layers, allowing to tackle both redundancy and dependency for efficient and effective representation learning. Finally, we flexibly stack our UniFormer blocks into a new powerful backbone, and adopt it for various vision tasks from image to video domain, from classification to dense prediction. Without any extra training data, our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification. With only ImageNet-1K pre-training, it can simply achieve state-of-the-art performance in a broad range of downstream tasks, e.g., it obtains 82.9/84.8 top-1 accuracy on Kinetics-400/600, 60.9/71.2 top-1 accuracy on Sth-Sth V1/V2 video classification, 53.8 box AP and 46.4 mask AP on COCO object detection, 50.8 mIoU on ADE20K semantic segmentation, and 77.4 AP on COCO pose estimation. We further build an efficient UniFormer with 2-4x higher throughput. Code is available at https://github.com/Sense-X/UniFormer.

翻译：从图像和视频中学习判别性表征是一项具有挑战性的任务，原因在于这些视觉数据中存在大量局部冗余和复杂的全局依赖性。卷积神经网络和视觉Transformer在过去几年中一直是两种主导框架。尽管CNN可通过小邻域内的卷积有效降低局部冗余，但其有限的感受野难以捕捉全局依赖关系。而ViT虽能通过自注意力有效捕获长程依赖，但所有令牌间盲目的相似性比较会导致高冗余。为解决这些问题，我们提出一种新型统一Transformer（UniFormer），它能以简洁的Transformer形式无缝融合卷积与自注意力的优势。与典型Transformer块不同，我们UniFormer块中的关系聚合器分别在浅层和深层配备局部和全局令牌亲和度，从而在处理冗余和依赖性的同时实现高效表征学习。最终，我们灵活地将UniFormer块堆叠成新型强力骨干网络，并将其应用于从图像到视频、从分类到密集预测的各种视觉任务。无需额外训练数据，我们的UniFormer在ImageNet-1K分类中达到86.3% top-1准确率。仅基于ImageNet-1K预训练，它即可在广泛的下游任务中取得最先进性能，例如在Kinetics-400/600上获得82.9%/84.8% top-1准确率，在Sth-Sth V1/V2视频分类中取得60.9%/71.2% top-1准确率，在COCO目标检测中达到53.8框AP和46.4掩码AP，在ADE20K语义分割中达到50.8 mIoU，在COCO姿态估计中达到77.4 AP。我们还构建了吞吐量提升2-4倍的高效UniFormer。代码开源地址：https://github.com/Sense-X/UniFormer。