AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation

In the realm of robotic intelligence, achieving efficient and precise RGB-D semantic segmentation is a key cornerstone. State-of-the-art multimodal semantic segmentation methods, primarily rooted in symmetrical skeleton networks, find it challenging to harmonize computational efficiency and precision. In this work, we propose AsymFormer, a novel network for real-time RGB-D semantic segmentation, which targets the minimization of superfluous parameters by optimizing the distribution of computational resources and introduces an asymmetrical backbone to allow for the effective fusion of multimodal features. Furthermore, we explore techniques to bolster network accuracy by redefining feature selection and extracting multi-modal self-similarity features without a substantial increase in the parameter count, thereby ensuring real-time execution on robotic platforms. Additionally, a Local Attention-Guided Feature Selection (LAFS) module is used to selectively fuse features from different modalities by leveraging their dependencies. Subsequently, a Cross-Modal Attention-Guided Feature Correlation Embedding (CMA) module is introduced to further extract cross-modal representations. This method is evaluated on NYUv2 and SUNRGBD datasets, with AsymFormer demonstrating competitive results with 54.1% mIoU on NYUv2 and 49.1% mIoU on SUNRGBD. Notably, AsymFormer achieves an inference speed of 65 FPS and after implementing mixed precision quantization, it attains an impressive inference speed of 79 FPS on RTX3090. This significantly outperforms existing multi-modal methods, thereby demonstrating that AsymFormer can strike a balance between high accuracy and efficiency for RGB-D semantic segmentation.

翻译：在机器人智能领域，实现高效且精确的RGB-D语义分割是核心基石。当前最先进的多模态语义分割方法多基于对称骨架网络，难以兼顾计算效率与精度。本文提出AsymFormer——一种用于实时RGB-D语义分割的新型网络，通过优化计算资源分配最小化冗余参数，并引入非对称骨干网络实现多模态特征的有效融合。进一步地，我们探索了通过重新定义特征选择与提取多模态自相似性特征来提升网络精度的技术，在不显著增加参数量的前提下确保机器人平台的实时执行能力。此外，采用局部注意力引导特征选择（LAFS）模块，通过利用模态间依赖关系选择性融合不同模态特征；继而引入跨模态注意力引导特征关联嵌入（CMA）模块，以进一步提取跨模态表征。本方法在NYUv2和SUNRGBD数据集上进行了评估：AsymFormer在NYUv2上达到54.1% mIoU，在SUNRGBD上达到49.1% mIoU，展现出竞争性结果。值得注意的是，AsymFormer在RTX3090上实现65 FPS的推理速度，经混合精度量化后推理速度更可达79 FPS，显著优于现有多模态方法。这证明AsymFormer能在RGB-D语义分割中实现高精度与高效率的平衡。