RGB-D salient object detection (SOD) aims to detect the prominent regions by jointly modeling RGB and depth information. Most RGB-D SOD methods apply the same type of backbones and fusion modules to identically learn the multimodality and multistage features. However, these features contribute differently to the final saliency results, which raises two issues: 1) how to model discrepant characteristics of RGB images and depth maps; 2) how to fuse these cross-modality features in different stages. In this paper, we propose a high-order discrepant interaction network (HODINet) for RGB-D SOD. Concretely, we first employ transformer-based and CNN-based architectures as backbones to encode RGB and depth features, respectively. Then, the high-order representations are delicately extracted and embedded into spatial and channel attentions for cross-modality feature fusion in different stages. Specifically, we design a high-order spatial fusion (HOSF) module and a high-order channel fusion (HOCF) module to fuse features of the first two and the last two stages, respectively. Besides, a cascaded pyramid reconstruction network is adopted to progressively decode the fused features in a top-down pathway. Extensive experiments are conducted on seven widely used datasets to demonstrate the effectiveness of the proposed approach. We achieve competitive performance against 24 state-of-the-art methods under four evaluation metrics.
翻译:RGB-D显著性目标检测旨在通过联合建模RGB图像与深度信息来检测显著区域。现有RGB-D显著性检测方法通常采用相同类型的骨干网络与融合模块,以统一方式学习多模态与多阶段特征。然而,这些特征对最终显著性结果的贡献存在差异,由此引发两个问题:1)如何表征RGB图像与深度图的差异特性;2)如何在不同阶段融合这些跨模态特征。本文提出面向RGB-D显著性检测的高阶差异交互网络。具体而言,我们首先分别采用基于Transformer与CNN的架构作为编码RGB特征与深度特征的骨干网络。随后,精心提取高阶表征并将其嵌入空间注意力与通道注意力机制中,实现不同阶段的跨模态特征融合。我们特别设计了高阶空间融合模块与高阶通道融合模块,分别用于融合前两阶段与后两阶段的特征。此外,采用级联金字塔重建网络通过自上而下的路径逐步解码融合特征。在七个广泛使用的数据集上进行了大量实验,验证了所提方法的有效性。与24种最新方法相比,我们在四项评估指标下取得了具有竞争力的性能。