RGB-D salient object detection (SOD), aiming to highlight prominent regions of a given scene by jointly modeling RGB and depth information, is one of the challenging pixel-level prediction tasks. Recently, the dual-attention mechanism has been devoted to this area due to its ability to strengthen the detection process. However, most existing methods directly fuse attentional cross-modality features under a manual-mandatory fusion paradigm without considering the inherent discrepancy between the RGB and depth, which may lead to a reduction in performance. Moreover, the long-range dependencies derived from global and local information make it difficult to leverage a unified efficient fusion strategy. Hence, in this paper, we propose the GL-DMNet, a novel dual mutual learning network with global-local awareness. Specifically, we present a position mutual fusion module and a channel mutual fusion module to exploit the interdependencies among different modalities in spatial and channel dimensions. Besides, we adopt an efficient decoder based on cascade transformer-infused reconstruction to integrate multi-level fusion features jointly. Extensive experiments on six benchmark datasets demonstrate that our proposed GL-DMNet performs better than 24 RGB-D SOD methods, achieving an average improvement of ~3% across four evaluation metrics compared to the second-best model (S3Net). Codes and results are available at https://github.com/kingkung2016/GL-DMNet.
翻译:RGB-D显著目标检测(SOD)旨在通过联合建模RGB与深度信息来突出给定场景中的显著区域,是一项具有挑战性的像素级预测任务。近年来,双重注意力机制因其能增强检测过程的能力而被应用于该领域。然而,现有方法大多在人工强制融合范式下直接融合注意力跨模态特征,未充分考虑RGB与深度信息间的固有差异,这可能导致性能下降。此外,源自全局与局部信息的长程依赖关系使得难以采用统一的高效融合策略。为此,本文提出GL-DMNet,一种具有全局-局部感知的新型双向互学习网络。具体而言,我们设计了位置互融合模块与通道互融合模块,以挖掘不同模态在空间维度与通道维度上的相互依赖关系。此外,我们采用基于级联Transformer注入重构的高效解码器来联合集成多层级融合特征。在六个基准数据集上的大量实验表明,所提出的GL-DMNet性能优于24种RGB-D SOD方法,在四项评估指标上相较次优模型(S3Net)平均提升约3%。代码与结果已发布于https://github.com/kingkung2016/GL-DMNet。