The recent emergence of hybrid models has introduced another transformative approach to solving computer vision tasks, slowly shifting away from conventional CNN (Convolutional Neural Network) and ViT (Vision Transformer). However, not enough effort has been made to efficiently combine these two approaches to improve capturing long-range dependencies prevalent in complex images. In this paper, we introduce iiANET (Inception Inspired Attention Network), an efficient hybrid model designed to capture long-range dependencies in complex images. The fundamental building block, iiABlock, integrates global 2D-MHSA (Multi-Head Self-Attention) with Registers, MBConv2 (MobileNetV2-based convolution), and dilated convolution in parallel, enabling the model to adeptly leverage self-attention for capturing long-range dependencies while utilizing MBConv2 for effective local-detail extraction and dilated convolution for efficiently expanding the kernel receptive field to capture more contextual information. Lastly, we serially integrate an ECANET (Efficient Channel Attention Network) at the end of each iiABlock to calibrate channel-wise attention for enhanced model performance. Extensive qualitative and quantitative comparative evaluation on various benchmarks demonstrates improved performance over some state-of-the-art models.
翻译:近期混合模型的出现为计算机视觉任务的解决引入了另一种变革性方法,逐渐从传统的CNN(卷积神经网络)和ViT(视觉Transformer)转向。然而,在有效结合这两种方法以改进对复杂图像中普遍存在的长程依赖的捕捉方面,尚未投入足够的努力。本文介绍了iiANET(Inception启发的注意力网络),这是一种旨在捕捉复杂图像中长程依赖的高效混合模型。其基本构建模块iiABlock并行集成了全局2D-MHSA(多头自注意力)与寄存器、MBConv2(基于MobileNetV2的卷积)以及空洞卷积,使模型能够巧妙地利用自注意力捕捉长程依赖,同时利用MBConv2进行有效的局部细节提取,并利用空洞卷积高效扩展核感受野以捕获更多上下文信息。最后,我们在每个iiABlock的末端串行集成了一个ECANET(高效通道注意力网络),以校准通道注意力,从而提升模型性能。在各种基准测试上进行的广泛定性与定量比较评估表明,其性能优于一些最先进的模型。