SODAWideNet++: Combining Attention and Convolutions for Salient Object Detection

Salient Object Detection (SOD) has traditionally relied on feature refinement modules that utilize the features of an ImageNet pre-trained backbone. However, this approach limits the possibility of pre-training the entire network because of the distinct nature of SOD and image classification. Additionally, the architecture of these backbones originally built for Image classification is sub-optimal for a dense prediction task like SOD. To address these issues, we propose a novel encoder-decoder-style neural network called SODAWideNet++ that is designed explicitly for SOD. Inspired by the vision transformers ability to attain a global receptive field from the initial stages, we introduce the Attention Guided Long Range Feature Extraction (AGLRFE) module, which combines large dilated convolutions and self-attention. Specifically, we use attention features to guide long-range information extracted by multiple dilated convolutions, thus taking advantage of the inductive biases of a convolution operation and the input dependency brought by self-attention. In contrast to the current paradigm of ImageNet pre-training, we modify 118K annotated images from the COCO semantic segmentation dataset by binarizing the annotations to pre-train the proposed model end-to-end. Further, we supervise the background predictions along with the foreground to push our model to generate accurate saliency predictions. SODAWideNet++ performs competitively on five different datasets while only containing 35% of the trainable parameters compared to the state-of-the-art models. The code and pre-computed saliency maps are provided at https://github.com/VimsLab/SODAWideNetPlusPlus.

翻译：显著目标检测（SOD）传统上依赖于利用ImageNet预训练主干网络特征的特征精炼模块。然而，由于SOD与图像分类任务本质上的差异，这种方法限制了整个网络进行预训练的可能性。此外，这些最初为图像分类设计的主干网络架构对于SOD这类密集预测任务并非最优。为解决这些问题，我们提出了一种新颖的编码器-解码器风格神经网络SODAWideNet++，该网络专为SOD任务设计。受视觉Transformer能够在初始阶段即获得全局感受野的能力启发，我们引入了注意力引导长程特征提取模块，该模块结合了大空洞卷积与自注意力机制。具体而言，我们利用注意力特征来引导由多个空洞卷积提取的长程信息，从而同时利用卷积操作的归纳偏置和自注意力带来的输入依赖性。与当前基于ImageNet预训练的范式不同，我们通过二值化标注的方式修改了COCO语义分割数据集中的118K标注图像，以端到端的方式预训练所提出的模型。此外，我们在监督前景预测的同时也监督背景预测，以推动模型生成更精确的显著性预测。SODAWideNet++在五个不同数据集上均表现出竞争力，且其可训练参数量仅相当于当前最先进模型的35%。代码与预计算的显著性图已发布于https://github.com/VimsLab/SODAWideNetPlusPlus。