Efficient RGB-D semantic segmentation has received considerable attention in mobile robots, which plays a vital role in analyzing and recognizing environmental information. According to previous studies, depth information can provide corresponding geometric relationships for objects and scenes, but actual depth data usually exist as noise. To avoid unfavorable effects on segmentation accuracy and computation, it is necessary to design an efficient framework to leverage cross-modal correlations and complementary cues. In this paper, we propose an efficient lightweight encoder-decoder network that reduces the computational parameters and guarantees the robustness of the algorithm. Working with channel and spatial fusion attention modules, our network effectively captures multi-level RGB-D features. A globally guided local affinity context module is proposed to obtain sufficient high-level context information. The decoder utilizes a lightweight residual unit that combines short- and long-distance information with a few redundant computations. Experimental results on NYUv2, SUN RGB-D, and Cityscapes datasets show that our method achieves a better trade-off among segmentation accuracy, inference time, and parameters than the state-of-the-art methods. The source code will be at https://github.com/MVME-HBUT/SGACNet
翻译:高效RGB-D语义分割在移动机器人领域受到广泛关注,其在分析和识别环境信息中起着关键作用。以往研究表明,深度信息可为物体及场景提供对应的几何关系,但实际深度数据通常存在噪声干扰。为避免对分割精度和计算效率产生不利影响,需设计高效框架以利用跨模态关联性与互补线索。本文提出一种高效轻量级编码器-解码器网络,在减少计算参数的同时保证算法鲁棒性。通过通道与空间融合注意力模块协同工作,该网络有效捕获多层级RGB-D特征;提出全局引导的局部亲和上下文模块以获取充分的高层上下文信息;解码器采用融合短程与长程信息的轻量化残差单元,仅需少量冗余计算。在NYUv2、SUN RGB-D及Cityscapes数据集上的实验结果表明,与现有最先进方法相比,本方法在分割精度、推理时间与参数量之间实现了更优的平衡。源代码将于https://github.com/MVME-HBUT/SGACNet 公开。