Most of the existing bi-modal (RGB-D and RGB-T) salient object detection methods utilize the convolution operation and construct complex interweave fusion structures to achieve cross-modal information integration. The inherent local connectivity of the convolution operation constrains the performance of the convolution-based methods to a ceiling. In this work, we rethink these tasks from the perspective of global information alignment and transformation. Specifically, the proposed \underline{c}ross-mod\underline{a}l \underline{v}iew-mixed transform\underline{er} (CAVER) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path. CAVER treats the multi-scale and multi-modal feature integration as a sequence-to-sequence context propagation and update process built on a novel view-mixed attention mechanism. Besides, considering the quadratic complexity w.r.t. the number of input tokens, we design a parameter-free patch-wise token re-embedding strategy to simplify operations. Extensive experimental results on RGB-D and RGB-T SOD datasets demonstrate that such a simple two-stream encoder-decoder framework can surpass recent state-of-the-art methods when it is equipped with the proposed components. Code and pretrained models will be available at \href{https://github.com/lartpang/CAVER}{the link}.
翻译:现有双模态(RGB-D与RGB-T)显著性目标检测方法多采用卷积运算并构建复杂的交织融合结构以实现跨模态信息整合。卷积运算固有的局部连通性将基于卷积的方法的性能限制在一个天花板之下。本文从全局信息对齐与转换的角度重新审视这些任务。具体而言,所提出的跨模态视图混合Transformer(CAVER)通过级联多个跨模态集成单元,构建了自上而下的基于Transformer的信息传播路径。CAVER将多尺度与多模态特征集成视为一种基于新颖视图混合注意力机制的序列到序列上下文传播与更新过程。此外,针对输入令牌数量带来的二次复杂度问题,我们设计了一种无需参数的分块令牌重嵌入策略以简化运算。在RGB-D和RGB-T SOD数据集上的大量实验结果表明,这种简单的双流编码器-解码器框架在配备所提组件后,能够超越近期最先进方法。代码与预训练模型将发布于 \href{https://github.com/lartpang/CAVER}{该链接}。