AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation

In the realm of robotic intelligence, achieving efficient and precise RGB-D semantic segmentation is a key cornerstone. State-of-the-art multimodal semantic segmentation methods, primarily rooted in symmetrical skeleton networks, find it challenging to harmonize computational efficiency and precision. In this work, we propose AsymFormer, a novel network for real-time RGB-D semantic segmentation, which targets the minimization of superfluous parameters by optimizing the distribution of computational resources and introduces an asymmetrical backbone to allow for the effective fusion of multimodal features. Furthermore, we explore techniques to bolster network accuracy by redefining feature selection and extracting multi-modal self-similarity features without a substantial increase in the parameter count, thereby ensuring real-time execution on robotic platforms. Additionally, a Local Attention-Guided Feature Selection (LAFS) module is used to selectively fuse features from different modalities by leveraging their dependencies. Subsequently, a Cross-Modal Attention-Guided Feature Correlation Embedding (CMA) module is introduced to further extract cross-modal representations. This method is evaluated on NYUv2 and SUNRGBD datasets, with AsymFormer demonstrating competitive results with 52.0% mIoU on NYUv2 and 49.1% mIoU on SUNRGBD. Notably, AsymFormer achieves an inference speed of 65 FPS and after implementing mixed precision quantization, it attains an impressive inference speed of 79 FPS on RTX3090. This significantly outperforms existing multi-modal methods, thereby demonstrating that AsymFormer can strike a balance between high accuracy and efficiency for RGB-D semantic segmentation.

翻译：在机器人智能领域，实现高效且精确的RGB-D语义分割是一项关键基石。当前主流的基于对称骨架网络的多模态语义分割方法难以协调计算效率与精度。本文提出AsymFormer——一种面向实时RGB-D语义分割的新型网络，通过优化计算资源分布最小化冗余参数，并引入非对称骨干网络实现多模态特征的有效融合。此外，我们探索了在不显著增加参数量的前提下，通过重新定义特征选择并提取多模态自相似性特征来提升网络精度的方法，从而确保在机器人平台上实现实时推理。具体地，采用局部注意力引导特征选择（LAFS）模块，利用模态间依赖关系选择性融合不同模态特征；随后引入跨模态注意力引导特征关联嵌入（CMA）模块，进一步提取跨模态表征。该方法在NYUv2和SUNRGBD数据集上进行了评估，AsymFormer在NYUv2上达到52.0% mIoU，在SUNRGBD上达到49.1% mIoU，展现出具有竞争力的结果。值得注意的是，AsymFormer在RTX3090上的推理速度达到65 FPS，采用混合精度量化后更实现79 FPS的显著推理速度，大幅超越现有多模态方法，证明了其在RGB-D语义分割中兼顾高精度与高效率的能力。