Glass surface ubiquitous in both daily life and professional environments presents a potential threat to vision-based systems, such as robot and drone navigation. To solve this challenge, most recent studies have shown significant interest in Video Glass Surface Detection (VGSD). We observe that objects in the reflection (or transmission) layer appear farther from the glass surfaces. Consequently, in video motion scenarios, the notable reflected (or transmitted) objects on the glass surface move slower than objects in non-glass regions within the same spatial plane, and this motion inconsistency can effectively reveal the presence of glass surfaces. Based on this observation, we propose a novel network, named MVGD-Net, for detecting glass surfaces in videos by leveraging motion inconsistency cues. Our MVGD-Net features three novel modules: the Cross-scale Multimodal Fusion Module (CMFM) that integrates extracted spatial features and estimated optical flow maps, the History Guided Attention Module (HGAM) and Temporal Cross Attention Module (TCAM), both of which further enhances temporal features. A Temporal-Spatial Decoder (TSD) is also introduced to fuse the spatial and temporal features for generating the glass region mask. Furthermore, for learning our network, we also propose a large-scale dataset, which comprises 312 diverse glass scenarios with a total of 19,268 frames. Extensive experiments demonstrate that our MVGD-Net outperforms relevant state-of-the-art methods.
翻译:玻璃表面在日常生活中和专业环境中无处不在,对基于视觉的系统(如机器人和无人机导航)构成潜在威胁。为应对这一挑战,近期大多数研究对视频玻璃表面检测(VGSD)展现出浓厚兴趣。我们观察到,反射(或透射)层中的物体看起来距离玻璃表面更远。因此,在视频运动场景中,玻璃表面上显著的反射(或透射)物体在同一空间平面内比非玻璃区域中的物体移动得更慢,而这种运动不一致性可以有效揭示玻璃表面的存在。基于这一观察,我们提出了一种新颖的网络,命名为MVGD-Net,通过利用运动不一致性线索来检测视频中的玻璃表面。我们的MVGD-Net具有三个新颖模块:跨尺度多模态融合模块(CMFM),用于整合提取的空间特征和估计的光流图;历史引导注意力模块(HGAM)和时间交叉注意力模块(TCAM),两者均进一步增强了时序特征。我们还引入了时空解码器(TSD)来融合空间和时序特征以生成玻璃区域掩码。此外,为训练我们的网络,我们还提出了一个大规模数据集,包含312个多样化的玻璃场景,总计19,268帧。大量实验表明,我们的MVGD-Net优于相关的先进方法。