Salient Object Detection in RGB-D Videos

Given the widespread adoption of depth-sensing acquisition devices, RGB-D videos and related data/media have gained considerable traction in various aspects of daily life. Consequently, conducting salient object detection (SOD) in RGB-D videos presents a highly promising and evolving avenue. Despite the potential of this area, SOD in RGB-D videos remains somewhat under-explored, with RGB-D SOD and video SOD (VSOD) traditionally studied in isolation. To explore this emerging field, this paper makes two primary contributions: the dataset and the model. On one front, we construct the RDVS dataset, a new RGB-D VSOD dataset with realistic depth and characterized by its diversity of scenes and rigorous frame-by-frame annotations. We validate the dataset through comprehensive attribute and object-oriented analyses, and provide training and testing splits. Moreover, we introduce DCTNet+, a three-stream network tailored for RGB-D VSOD, with an emphasis on RGB modality and treats depth and optical flow as auxiliary modalities. In pursuit of effective feature enhancement, refinement, and fusion for precise final prediction, we propose two modules: the multi-modal attention module (MAM) and the refinement fusion module (RFM). To enhance interaction and fusion within RFM, we design a universal interaction module (UIM) and then integrate holistic multi-modal attentive paths (HMAPs) for refining multi-modal low-level features before reaching RFMs. Comprehensive experiments, conducted on pseudo RGB-D video datasets alongside our RDVS, highlight the superiority of DCTNet+ over 17 VSOD models and 14 RGB-D SOD models. Ablation experiments were performed on both pseudo and realistic RGB-D video datasets to demonstrate the advantages of individual modules as well as the necessity of introducing realistic depth. Our code together with RDVS dataset will be available at https://github.com/kerenfu/RDVS/.

翻译：鉴于深度感知采集设备的广泛普及，RGB-D视频及相关数据/媒体在日常生活中已获得显著关注。因此，在RGB-D视频中开展显著目标检测（SOD）具有高度前景且持续发展的研究路径。尽管该领域潜力巨大，RGB-D视频中的SOD研究仍相对不足，传统上RGB-D SOD与视频SOD（VSOD）被独立研究。为探索这一新兴领域，本文做出两项主要贡献：数据集与模型。一方面，我们构建了RDVS数据集——一个具有真实深度信息、场景多样性及逐帧精细标注的新型RGB-D VSOD数据集。通过全面的属性分析与对象导向分析验证数据集有效性，并提供训练集与测试集划分。另一方面，我们提出DCTNet+——一种专为RGB-D VSOD设计的三流网络，以RGB模态为核心，将深度和光流视为辅助模态。为实现精确最终预测所需的有效特征增强、精炼与融合，我们提出两种模块：多模态注意力模块（MAM）和精炼融合模块（RFM）。为增强RFM内部的交互与融合，我们设计了通用交互模块（UIM），并集成整体多模态注意力路径（HMAPs）以在特征输入RFM前对多模态低级特征进行精炼。基于伪RGB-D视频数据集及我们RDVS的全面实验表明，DCTNet+优于17个VSOD模型和14个RGB-D SOD模型。在伪与真实RGB-D视频数据集上进行的消融实验验证了各模块的优势及引入真实深度的必要性。我们的代码及RDVS数据集将发布于https://github.com/kerenfu/RDVS/。