Remarkable effectiveness of the channel or spatial attention mechanisms for producing more discernible feature representation are illustrated in various computer vision tasks. However, modeling the cross-channel relationships with channel dimensionality reduction may bring side effect in extracting deep visual representations. In this paper, a novel efficient multi-scale attention (EMA) module is proposed. Focusing on retaining the information on per channel and decreasing the computational overhead, we reshape the partly channels into the batch dimensions and group the channel dimensions into multiple sub-features which make the spatial semantic features well-distributed inside each feature group. Specifically, apart from encoding the global information to re-calibrate the channel-wise weight in each parallel branch, the output features of the two parallel branches are further aggregated by a cross-dimension interaction for capturing pixel-level pairwise relationship. We conduct extensive ablation studies and experiments on image classification and object detection tasks with popular benchmarks (e.g., CIFAR-100, ImageNet-1k, MS COCO and VisDrone2019) for evaluating its performance.
翻译:通道或空间注意力机制在多种计算机视觉任务中展现出了显著的有效性,能够产生更具辨识力的特征表示。然而,通过通道降维来建模跨通道关系可能会在提取深层视觉表征时带来副作用。本文提出了一种新颖的高效多尺度注意力模块(EMA)。为了保留每个通道的信息并降低计算开销,我们将部分通道重塑为批处理维度,并将通道维度分组为多个子特征,使空间语义特征在每个特征组内均匀分布。具体而言,除了在每个并行分支中编码全局信息以重新校准通道权重外,还通过跨维度交互进一步聚合两个并行分支的输出特征,以捕获像素级的成对关系。我们在图像分类和目标检测任务上进行了广泛的消融研究和实验,使用流行的基准数据集(如CIFAR-100、ImageNet-1k、MS COCO和VisDrone2019)评估其性能。