Most of existing RGB-D salient object detection (SOD) methods follow the CNN-based paradigm, which is unable to model long-range dependencies across space and modalities due to the natural locality of CNNs. Here we propose the Hierarchical Cross-modal Transformer (HCT), a new multi-modal transformer, to tackle this problem. Unlike previous multi-modal transformers that directly connecting all patches from two modalities, we explore the cross-modal complementarity hierarchically to respect the modality gap and spatial discrepancy in unaligned regions. Specifically, we propose to use intra-modal self-attention to explore complementary global contexts, and measure spatial-aligned inter-modal attention locally to capture cross-modal correlations. In addition, we present a Feature Pyramid module for Transformer (FPT) to boost informative cross-scale integration as well as a consistency-complementarity module to disentangle the multi-modal integration path and improve the fusion adaptivity. Comprehensive experiments on a large variety of public datasets verify the efficacy of our designs and the consistent improvement over state-of-the-art models.
翻译:现有的大多数RGB-D显著性目标检测方法遵循基于CNN的范式,由于CNN天然的局部性,其无法建模跨空间与跨模态的长距离依赖关系。为此,我们提出层级跨模态Transformer(HCT)——一种新的多模态Transformer——以解决该问题。不同于以往直接连接两个模态所有块的多模态Transformer,我们通过层级化探索跨模态互补性,以尊重未对齐区域中的模态差异与空间偏差。具体而言,我们提出使用模态内自注意力探索互补全局上下文,并在局部区域中度量空间对齐的模态间注意力以捕获跨模态关联。此外,我们提出用于Transformer的特征金字塔模块(FPT)以增强信息性的跨尺度整合,并提出一致-互补模块以解耦多模态整合路径并提升融合自适应性。在大量公共数据集上的综合实验验证了我们设计的有效性,以及相较于最先进模型的一致性能提升。