The rapid development of deep learning provides a significant improvement of salient object detection combining both RGB and thermal images. However, existing deep learning-based models suffer from two major shortcomings. First, the computation and memory demands of Transformer-based models with quadratic complexity are unbearable, especially in handling high-resolution bi-modal feature fusion. Second, even if learning converges to an ideal solution, there remains a frequency gap between the prediction and ground truth. Therefore, we propose a purely fast Fourier transform-based model, namely deep Fourier-embedded network (DFENet), for learning bi-modal information of RGB and thermal images. On one hand, fast Fourier transform efficiently fetches global dependencies with low complexity. Inspired by this, we design modal-coordinated perception attention to fuse the frequency gap between RGB and thermal modalities with multi-dimensional representation enhancement. To obtain reliable detailed information during decoding, we design the frequency-decomposed edge-aware module (FEM) to clarify object edges by deeply decomposing low-level features. Moreover, we equip proposed Fourier residual channel attention block in each decoder layer to prioritize high-frequency information while aligning channel global relationships. On the other hand, we propose co-focus frequency loss (CFL) to steer FEM towards minimizing the frequency gap. CFL dynamically weights hard frequencies during edge frequency reconstruction by cross-referencing the bi-modal edge information in the Fourier domain. This frequency-level refinement of edge features further contributes to the quality of the final pixel-level prediction. Extensive experiments on four bi-modal salient object detection benchmark datasets demonstrate our proposed DFENet outperforms twelve existing state-of-the-art models.
翻译:深度学习的快速发展显著提升了结合RGB与热红外图像的显著目标检测性能。然而,现有的基于深度学习的模型存在两个主要缺陷。首先,基于Transformer的模型具有二次复杂度,其计算与内存需求难以承受,尤其是在处理高分辨率双模态特征融合时。其次,即使学习收敛到一个理想解,预测结果与真实标注之间仍存在频率差距。为此,我们提出了一种完全基于快速傅里叶变换的模型,即深度傅里叶嵌入网络(DFENet),用于学习RGB与热红外图像的双模态信息。一方面,快速傅里叶变换能够以低复杂度高效获取全局依赖关系。受此启发,我们设计了模态协调感知注意力机制,通过多维表示增强来融合RGB与热红外模态间的频率差距。为了在解码过程中获取可靠的细节信息,我们设计了频率分解边缘感知模块(FEM),通过对低层特征进行深度分解来明晰目标边缘。此外,我们在每个解码器层中配备了所提出的傅里叶残差通道注意力块,以优先处理高频信息,同时对齐通道全局关系。另一方面,我们提出了共聚焦频率损失(CFL)来引导FEM最小化频率差距。CFL通过在傅里叶域中交叉参考双模态边缘信息,在边缘频率重建过程中动态加权困难频率。这种边缘特征在频率层面的细化进一步提升了最终像素级预测的质量。在四个双模态显著目标检测基准数据集上的大量实验表明,我们提出的DFENet在性能上超越了十二个现有的先进模型。