AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation

In the realm of robotic intelligence, achieving efficient and precise RGB-D semantic segmentation is a key cornerstone. State-of-the-art multimodal semantic segmentation methods, primarily rooted in symmetrical skeleton networks, find it challenging to harmonize computational efficiency and precision. In this work, we propose AsymFormer, a novel network for real-time RGB-D semantic segmentation, which targets the minimization of superfluous parameters by optimizing the distribution of computational resources and introduces an asymmetrical backbone to allow for the effective fusion of multimodal features. Furthermore, we explore techniques to bolster network accuracy by redefining feature selection and extracting multi-modal self-similarity features without a substantial increase in the parameter count, thereby ensuring real-time execution on robotic platforms. Additionally, a Local Attention-Guided Feature Selection (LAFS) module is used to selectively fuse features from different modalities by leveraging their dependencies. Subsequently, a Cross-Modal Attention-Guided Feature Correlation Embedding (CMA) module is introduced to further extract cross-modal representations. This method is evaluated on NYUv2 and SUNRGBD datasets, with AsymFormer demonstrating competitive results with 52.0\% mIoU on NYUv2 and 49.1\% mIoU on SUNRGBD. Notably, AsymFormer achieves an inference speed of 65 FPS and after implementing mixed precision quantization, it attains an impressive inference speed of 79 FPS on RTX3090. This significantly outperforms existing multi-modal methods, thereby demonstrating that AsymFormer can strike a balance between high accuracy and efficiency for RGB-D semantic segmentation.

翻译：在机器人智能领域中，实现高效且精确的RGB-D语义分割是关键基础。当前最先进的多模态语义分割方法主要基于对称骨架网络，难以在计算效率与精度之间取得平衡。本文提出AsymFormer——一种面向实时RGB-D语义分割的新型网络，通过优化计算资源分配来最小化冗余参数，并引入非对称骨干网络实现多模态特征的有效融合。此外，我们探索了在不显著增加参数量的前提下通过重新定义特征选择与提取多模态自相似性特征来提升网络精度的技术，从而确保在机器人平台上实现实时运行。其中，局部注意力引导特征选择模块（LAFS）利用模态间的依赖关系选择性地融合不同模态特征；随后引入跨模态注意力引导特征关联嵌入模块（CMA）进一步提取跨模态表征。该方法在NYUv2与SUNRGBD数据集上进行了评估，AsymFormer在NYUv2上达到52.0% mIoU，在SUNRGBD上达到49.1% mIoU。值得注意的是，AsymFormer在RTX3090上的推理速度达到65 FPS，采用混合精度量化后推理速度更达到79 FPS，显著优于现有多模态方法，从而证明AsymFormer能够在RGB-D语义分割中实现高精度与高效率的平衡。