AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation

In the realm of robotic intelligence, achieving efficient and precise RGB-D semantic segmentation is a key cornerstone. State-of-the-art multimodal semantic segmentation methods, primarily rooted in symmetrical skeleton networks, find it challenging to harmonize computational efficiency and precision. In this work, we propose AsymFormer, a novel network for real-time RGB-D semantic segmentation, which targets the minimization of superfluous parameters by optimizing the distribution of computational resources and introduces an asymmetrical backbone to allow for the effective fusion of multimodal features. Furthermore, we explore techniques to bolster network accuracy by redefining feature selection and extracting multi-modal self-similarity features without a substantial increase in the parameter count, thereby ensuring real-time execution on robotic platforms. Additionally, a Local Attention-Guided Feature Selection (LAFS) module is used to selectively fuse features from different modalities by leveraging their dependencies. Subsequently, a Cross-Modal Attention-Guided Feature Correlation Embedding (CMA) module is introduced to further extract cross-modal representations. This method is evaluated on NYUv2 and SUNRGBD datasets, with AsymFormer demonstrating competitive results with 52.0% mIoU on NYUv2 and 49.1% mIoU on SUNRGBD. Notably, AsymFormer achieves an inference speed of 65 FPS and after implementing mixed precision quantization, it attains an impressive inference speed of 79 FPS on RTX3090. This significantly outperforms existing multi-modal methods, thereby demonstrating that AsymFormer can strike a balance between high accuracy and efficiency for RGB-D semantic segmentation.

翻译：在机器人智能领域中，实现高效且精确的RGB-D语义分割是关键技术基石。当前基于对称骨架网络的主流多模态语义分割方法难以在计算效率与精度之间取得平衡。本文提出AsymFormer——一种面向实时RGB-D语义分割的新型网络，通过优化计算资源分配以最小化冗余参数，并引入非对称骨架实现多模态特征的高效融合。进一步地，我们探索了在不显著增加参数量的前提下，通过重新定义特征选择机制与提取多模态自相似性特征来提升网络精度的技术方案，从而确保在机器人平台上的实时推理能力。具体而言，采用局部注意力引导特征选择（LAFS）模块，利用模态间依赖关系对多模态特征进行选择性融合；随后引入跨模态注意力引导特征关联嵌入（CMA）模块以深度提取跨模态表征。在NYUv2与SUNRGBD数据集上的评估表明，AsymFormer分别取得52.0%与49.1%的mIoU竞争性结果。值得注意的是，该网络在RTX3090上实现65 FPS的推理速度，经混合精度量化后更达到79 FPS的卓越性能，显著超越现有多模态方法，验证了其在RGB-D语义分割任务中兼顾高精度与高效率的平衡能力。