The emergence of different sensors (Near-Infrared, Depth, etc.) is a remedy for the limited application scenarios of traditional RGB camera. The RGB-X tasks, which rely on RGB input and another type of data input to resolve specific problems, have become a popular research topic in multimedia. A crucial part in two-branch RGB-X deep neural networks is how to fuse information across modalities. Given the tremendous information inside RGB-X networks, previous works typically apply naive fusion (e.g., average or max fusion) or only focus on the feature fusion at the same scale(s). While in this paper, we propose a novel method called RXFOOD for the fusion of features across different scales within the same modality branch and from different modality branches simultaneously in a unified attention mechanism. An Energy Exchange Module is designed for the interaction of each feature map's energy matrix, who reflects the inter-relationship of different positions and different channels inside a feature map. The RXFOOD method can be easily incorporated to any dual-branch encoder-decoder network as a plug-in module, and help the original backbone network better focus on important positions and channels for object of interest detection. Experimental results on RGB-NIR salient object detection, RGB-D salient object detection, and RGBFrequency image manipulation detection demonstrate the clear effectiveness of the proposed RXFOOD.
翻译:不同传感器(近红外、深度等)的出现弥补了传统RGB相机应用场景有限的缺陷。依赖RGB输入及另一类数据输入解决特定问题的RGB-X任务,已成为多媒体领域的热门研究课题。在双分支RGB-X深度神经网络中,如何实现跨模态信息融合是核心环节。鉴于RGB-X网络蕴含海量信息,现有工作通常采用朴素融合(如平均或最大融合),或仅关注同尺度下的特征融合。本文提出名为RXFOOD的创新方法,通过统一注意力机制同时实现同一模态分支内跨尺度特征融合与不同模态分支间的特征融合。我们设计了能量交换模块,用于实现各特征图能量矩阵的交互——该矩阵反映了特征图内不同位置与不同通道间的内在关联。RXFOOD方法可作为即插即用模块轻松嵌入任意双分支编码器-解码器网络,帮助原始骨干网络更好地聚焦于感兴趣目标检测的关键位置与通道。在RGB-NIR显著性目标检测、RGB-D显著性目标检测及RGB-频率图像篡改检测任务上的实验结果表明,所提出的RXFOOD方法具有显著有效性。