ELA: Efficient Local Attention for Deep Convolutional Neural Networks

The attention mechanism has gained significant recognition in the field of computer vision due to its ability to effectively enhance the performance of deep neural networks. However, existing methods often struggle to effectively utilize spatial information or, if they do, they come at the cost of reducing channel dimensions or increasing the complexity of neural networks. In order to address these limitations, this paper introduces an Efficient Local Attention (ELA) method that achieves substantial performance improvements with a simple structure. By analyzing the limitations of the Coordinate Attention method, we identify the lack of generalization ability in Batch Normalization, the adverse effects of dimension reduction on channel attention, and the complexity of attention generation process. To overcome these challenges, we propose the incorporation of 1D convolution and Group Normalization feature enhancement techniques. This approach enables accurate localization of regions of interest by efficiently encoding two 1D positional feature maps without the need for dimension reduction, while allowing for a lightweight implementation. We carefully design three hyperparameters in ELA, resulting in four different versions: ELA-T, ELA-B, ELA-S, and ELA-L, to cater to the specific requirements of different visual tasks such as image classification, object detection and sementic segmentation. ELA can be seamlessly integrated into deep CNN networks such as ResNet, MobileNet, and DeepLab. Extensive evaluations on the ImageNet, MSCOCO, and Pascal VOC datasets demonstrate the superiority of the proposed ELA module over current state-of-the-art methods in all three aforementioned visual tasks.

翻译：注意力机制因其能有效提升深度神经网络性能而在计算机视觉领域获得广泛认可。然而，现有方法往往难以有效利用空间信息，即便有所利用，也常以降低通道维度或增加网络复杂度为代价。为解决这些局限，本文提出了一种高效局部注意力（ELA）方法，在保持简单结构的同时实现了显著的性能提升。通过分析坐标注意力方法的局限性，我们发现批量归一化在泛化能力上的不足、维度缩减对通道注意力的负面影响，以及注意力生成过程的复杂性问题。为克服这些挑战，我们提出融合一维卷积与组归一化特征增强技术。该方法无需降维即可高效编码两个一维位置特征图，实现对感兴趣区域的精准定位，同时保持轻量化实现。我们精心设计了ELA中的三个超参数，衍生出四个不同版本：ELA-T、ELA-B、ELA-S和ELA-L，以满足图像分类、目标检测和语义分割等不同视觉任务的特定需求。ELA可无缝集成至ResNet、MobileNet和DeepLab等深度卷积神经网络中。在ImageNet、MSCOCO和Pascal VOC数据集上的广泛评估表明，所提出的ELA模块在上述三项视觉任务中均优于当前最先进方法。