Neural networks for visual content understanding have recently evolved from convolutional ones (CNNs) to transformers. The prior (CNN) relies on small-windowed kernels to capture the regional clues, demonstrating solid local expressiveness. On the contrary, the latter (transformer) establishes long-range global connections between localities for holistic learning. Inspired by this complementary nature, there is a growing interest in designing hybrid models to best utilize each technique. Current hybrids merely replace convolutions as simple approximations of linear projection or juxtapose a convolution branch with attention, without concerning the importance of local/global modeling. To tackle this, we propose a new hybrid named Adaptive Split-Fusion Transformer (ASF-former) to treat convolutional and attention branches differently with adaptive weights. Specifically, an ASF-former encoder equally splits feature channels into half to fit dual-path inputs. Then, the outputs of dual-path are fused with weighting scalars calculated from visual cues. We also design the convolutional path compactly for efficiency concerns. Extensive experiments on standard benchmarks, such as ImageNet-1K, CIFAR-10, and CIFAR-100, show that our ASF-former outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy (83.9% on ImageNet-1K), under similar conditions (12.9G MACs/56.7M Params, without large-scale pre-training). The code is available at: https://github.com/szx503045266/ASF-former.
翻译:用于视觉内容理解的神经网络近期已从卷积神经网络(CNN)演变为Transformer。前者(CNN)依赖小窗口核捕获局部线索,展现出稳健的局部表达能力;而后者(Transformer)则在局部单元间建立长程全局连接以实现整体学习。受这种互补特性的启发,设计混合模型以充分利用两种技术的研究日益兴起。现有混合方法仅将卷积替换为线性投影的简单近似,或将卷积分支与注意力机制并列,而未关注局部/全局建模的重要性。为解决此问题,我们提出一种名为自适应分裂融合Transformer(ASF-former)的新型混合模型,通过自适应权重差异化处理卷积与注意力分支。具体而言,ASF-former编码器将特征通道均等分裂为两半以适配双路径输入,再通过视觉线索计算的加权标量融合双路径输出。同时,为提升效率,我们紧凑设计了卷积路径。在ImageNet-1K、CIFAR-10和CIFAR-100等标准基准上的大量实验表明,在相似条件下(12.9G MACs/56.7M参数,无需大规模预训练),我们的ASF-former在准确率上超越CNN、Transformer对应模型及混合先导模型(ImageNet-1K达83.9%)。代码开源于:https://github.com/szx503045266/ASF-former。