Developing a new Salient Object Detection (SOD) model involves selecting an ImageNet pre-trained backbone and creating novel feature refinement modules to use backbone features. However, adding new components to a pre-trained backbone needs retraining the whole network on the ImageNet dataset, which requires significant time. Hence, we explore developing a neural network from scratch directly trained on SOD without ImageNet pre-training. Such a formulation offers full autonomy to design task-specific components. To that end, we propose SODAWideNet, an encoder-decoder-style network for Salient Object Detection. We deviate from the commonly practiced paradigm of narrow and deep convolutional models to a wide and shallow architecture, resulting in a parameter-efficient deep neural network. To achieve a shallower network, we increase the receptive field from the beginning of the network using a combination of dilated convolutions and self-attention. Therefore, we propose Multi Receptive Field Feature Aggregation Module (MRFFAM) that efficiently obtains discriminative features from farther regions at higher resolutions using dilated convolutions. Next, we propose Multi-Scale Attention (MSA), which creates a feature pyramid and efficiently computes attention across multiple resolutions to extract global features from larger feature maps. Finally, we propose two variants, SODAWideNet-S (3.03M) and SODAWideNet (9.03M), that achieve competitive performance against state-of-the-art models on five datasets.
翻译:开发新的显著目标检测(SOD)模型通常需要选择基于ImageNet预训练的骨干网络,并设计新颖的特征细化模块来利用骨干网络特征。然而,为预训练的骨干网络添加新组件需在ImageNet数据集上重新训练整个网络,耗费大量时间。为此,我们探索直接从头训练SOD专用神经网络,无需ImageNet预训练。这种方案为设计任务特定组件提供了完全自主性。基于此,我们提出SODAWideNet——一种用于显著目标检测的编码器-解码器架构网络。我们摒弃了传统窄而深卷积模型的范式,转向宽而浅的架构,从而构建参数高效的深度神经网络。为实现更浅的网络,我们从网络起始阶段通过膨胀卷积与自注意力的组合扩大感受野。因此,我们提出多感受野特征聚合模块(MRFFAM),利用膨胀卷积高效地从更高分辨率下的远距离区域获取判别性特征。接着,我们提出多尺度注意力(MSA),构建特征金字塔并在多分辨率上高效计算注意力,以从较大特征图中提取全局特征。最终,我们提出两个变体:SODAWideNet-S(3.03M参数量)和SODAWideNet(9.03M参数量),在五个数据集上取得了与最先进模型相竞争的检测性能。