Given the widespread adoption of depth-sensing acquisition devices, RGB-D videos and related data/media have gained considerable traction in various aspects of daily life. Consequently, conducting salient object detection (SOD) in RGB-D videos presents a highly promising and evolving avenue. Despite the potential of this area, SOD in RGB-D videos remains somewhat under-explored, with RGB-D SOD and video SOD (VSOD) traditionally studied in isolation. To explore this emerging field, this paper makes two primary contributions: the dataset and the model. On one front, we construct the RDVS dataset, a new RGB-D VSOD dataset with realistic depth and characterized by its diversity of scenes and rigorous frame-by-frame annotations. We validate the dataset through comprehensive attribute and object-oriented analyses, and provide training and testing splits. Moreover, we introduce DCTNet+, a three-stream network tailored for RGB-D VSOD, with an emphasis on RGB modality and treats depth and optical flow as auxiliary modalities. In pursuit of effective feature enhancement, refinement, and fusion for precise final prediction, we propose two modules: the multi-modal attention module (MAM) and the refinement fusion module (RFM). To enhance interaction and fusion within RFM, we design a universal interaction module (UIM) and then integrate holistic multi-modal attentive paths (HMAPs) for refining multi-modal low-level features before reaching RFMs. Comprehensive experiments, conducted on pseudo RGB-D video datasets alongside our RDVS, highlight the superiority of DCTNet+ over 17 VSOD models and 14 RGB-D SOD models. Ablation experiments were performed on both pseudo and realistic RGB-D video datasets to demonstrate the advantages of individual modules as well as the necessity of introducing realistic depth. Our code together with RDVS dataset will be available at https://github.com/kerenfu/RDVS/.
翻译:鉴于深度传感采集设备的广泛应用,RGB-D视频及相关数据/媒体在日常生活中获得了显著关注。因此,在RGB-D视频中进行显著目标检测(SOD)是一条极具前景且不断发展的研究路径。尽管该领域潜力巨大,但RGB-D视频中的SOD仍相对探索不足,传统上RGB-D SOD与视频SOD(VSOD)被分开研究。为探索这一新兴领域,本文做出两项主要贡献:数据集与模型。一方面,我们构建了RDVS数据集——一个具有真实深度、场景多样性及严格的逐帧标注的全新RGB-D VSOD数据集。通过全面的属性分析与面向对象分析验证了数据集的有效性,并提供了训练集和测试集划分。另一方面,我们提出DCTNet+——一个专门针对RGB-D VSOD的三流网络,该网络以RGB模态为主,将深度与光流作为辅助模态。为实现精准最终预测所需的特征增强、精炼与融合,我们提出了两个模块:多模态注意力模块(MAM)与精炼融合模块(RFM)。为增强RFM内部的交互与融合,我们设计了通用交互模块(UIM),并整合了全多模态注意力路径(HMAPs),用于在特征输入RFM之前对多模态低级特征进行精炼。在伪RGB-D视频数据集及我们构建的RDVS上进行的全面实验表明,DCTNet+的性能超越了17个VSOD模型和14个RGB-D SOD模型。针对伪RGB-D与现实RGB-D视频数据集分别进行的消融实验,验证了各单独模块的优势以及引入真实深度的必要性。我们的代码及RDVS数据集将开源至https://github.com/kerenfu/RDVS/。