Most existing RGB-D semantic segmentation methods focus on the feature level fusion, including complex cross-modality and cross-scale fusion modules. However, these methods may cause misalignment problem in the feature fusion process and counter-intuitive patches in the segmentation results. Inspired by the popular pixel-node-pixel pipeline, we propose to 1) fuse features from two modalities in a late fusion style, during which the geometric feature injection is guided by texture feature prior; 2) employ Graph Neural Networks (GNNs) on the fused feature to alleviate the emergence of irregular patches by inferring patch relationship. At the 3D feature extraction stage, we argue that traditional CNNs are not efficient enough for depth maps. So, we encode depth map into normal map, after which CNNs can easily extract object surface tendencies.At projection matrix generation stage, we find the existence of Biased-Assignment and Ambiguous-Locality issues in the original pipeline. Therefore, we propose to 1) adopt the Kullback-Leibler Loss to ensure no missing important pixel features, which can be viewed as hard pixel mining process; 2) connect regions that are close to each other in the Euclidean space as well as in the semantic space with larger edge weights so that location informations can been considered. Extensive experiments on two public datasets, NYU-DepthV2 and SUN RGB-D, have shown that our approach can consistently boost the performance of RGB-D semantic segmentation task.
翻译:现有的大多数RGB-D语义分割方法主要关注特征层面的融合,包括复杂的跨模态与跨尺度融合模块。然而,这些方法可能在特征融合过程中产生错位问题,并在分割结果中产生反直觉的斑块。受流行的像素-节点-像素流程启发,我们提出:1)以晚期融合方式融合两种模态的特征,在此过程中几何特征注入由纹理特征先验引导;2)在融合特征上应用图神经网络(GNNs),通过推断斑块关系来缓解不规则斑块的出现。在3D特征提取阶段,我们认为传统CNN对深度图的处理效率不足。因此,我们将深度图编码为法线图,之后CNN可以轻松提取物体表面趋势。在投影矩阵生成阶段,我们发现原始流程中存在偏置分配与模糊定位问题。为此,我们提出:1)采用Kullback-Leibler损失确保不遗漏重要像素特征,这可视为硬像素挖掘过程;2)将欧几里得空间和语义空间中均接近的区域以更大边权连接,从而纳入位置信息。在NYU-DepthV2和SUN RGB-D两个公开数据集上的大量实验表明,我们的方法能持续提升RGB-D语义分割任务的性能。