Most existing salient object detection methods mostly use U-Net or feature pyramid structure, which simply aggregates feature maps of different scales, ignoring the uniqueness and interdependence of them and their respective contributions to the final prediction. To overcome these, we propose the M$^3$Net, i.e., the Multilevel, Mixed and Multistage attention network for Salient Object Detection (SOD). Firstly, we propose Multiscale Interaction Block which innovatively introduces the cross-attention approach to achieve the interaction between multilevel features, allowing high-level features to guide low-level feature learning and thus enhancing salient regions. Secondly, considering the fact that previous Transformer based SOD methods locate salient regions only using global self-attention while inevitably overlooking the details of complex objects, we propose the Mixed Attention Block. This block combines global self-attention and window self-attention, aiming at modeling context at both global and local levels to further improve the accuracy of the prediction map. Finally, we proposed a multilevel supervision strategy to optimize the aggregated feature stage-by-stage. Experiments on six challenging datasets demonstrate that the proposed M$^3$Net surpasses recent CNN and Transformer-based SOD arts in terms of four metrics. Codes are available at https://github.com/I2-Multimedia-Lab/M3Net.
翻译:现有显著目标检测方法大多采用U型网络或特征金字塔结构,这类方法仅简单聚合不同尺度的特征图,未能充分考虑各层级特征间的独特性与相互依存性及其对最终预测的独立贡献。为解决上述问题,我们提出M$^3$Net,即面向显著目标检测(SOD)的多级、混合与多阶段注意力网络。首先,我们提出多尺度交互模块,创新性地引入交叉注意力机制实现多层级特征间的交互,使高层特征能够引导低层特征学习,从而增强显著区域。其次,针对现有基于Transformer的SOD方法仅依赖全局自注意力定位显著区域而不可避免地忽略复杂物体细节的问题,我们提出混合注意力模块。该模块融合全局自注意力与窗口自注意力,旨在从全局与局部两个层级建模上下文信息,进一步提升预测图的准确性。最后,我们提出多层级监督策略,以逐步优化聚合后的特征。在六个具有挑战性的数据集上的实验表明,所提出的M$^3$Net在四项评估指标上均超越近期基于CNN和Transformer的SOD方法。代码发布于 https://github.com/I2-Multimedia-Lab/M3Net。