Channel and spatial attention mechanisms introduced in earlier work enhance the representational capabilities of deep convolutional neural networks (CNNs) but often increase parameter and computational costs. While recent approaches focus solely on efficient feature context modeling for channel attention, we aim to model both channel and spatial attention comprehensively with minimal parameters and reduced computation. Leveraging the principles of relational modeling in graphs, we introduce a constant-parameter module, \textit{STEAM: Squeeze and Transform Enhanced Attention Module}, which integrates channel and spatial attention to enhance the representation power of CNNs. To our knowledge, we are the first to propose a graph-based approach for modeling both channel and spatial attention, utilizing concepts from multi-head graph transformers. Additionally, we introduce \textit{Output Guided Pooling} (OGP), which efficiently captures spatial context to further enhance spatial attention. We extensively evaluate STEAM for large-scale image classification, object detection and instance segmentation on standard benchmark datasets. STEAM achieves a \(2\%\) increase in accuracy over the standard ResNet-50 model with only a meager increase in GFLOPs. Furthermore, STEAM outperforms the leading modules, ECA and GCT, in terms of accuracy while achieving a threefold reduction in GFLOPs. The code will be made available upon acceptance.
翻译:先前工作中引入的通道与空间注意力机制增强了深度卷积神经网络(CNN)的表征能力,但往往增加了参数和计算开销。近期方法虽专注于高效特征上下文建模以提升通道注意力,而我们的目标是以最小参数和简化计算全面建模通道与空间注意力。借助图关系建模原理,我们提出一个恒定参数模块——STEAM:压缩与变换增强注意力模块,该模块整合通道与空间注意力以增强CNN的表征能力。据我们所知,这是首次采用基于图的方法同时建模通道和空间注意力,并利用了多头图Transformer的概念。此外,我们引入输出引导池化(OGP),该机制能高效捕获空间上下文以进一步增强空间注意力。我们在标准基准数据集上对STEAM进行了大规模图像分类、目标检测与实例分割的全面评估。与标准ResNet-50模型相比,STEAM在仅微小增加GFLOPs的情况下实现了2%的准确率提升。进一步地,STEAM在准确率上优于领先模块ECA和GCT,同时将GFLOPs降至三分之一。代码将在论文被接收后开源。