Recently, integrating the local modeling capabilities of Convolutional Neural Networks (CNNs) with the global dependency strengths of Transformers has created a sensation in the semantic segmentation community. However, substantial computational workloads and high hardware memory demands remain major obstacles to their further application in real-time scenarios. In this work, we propose a Lightweight Multiple-Information Interaction Network (LMIINet) for real-time semantic segmentation, which effectively combines CNNs and Transformers while reducing redundant computations and memory footprints. It features Lightweight Feature Interaction Bottleneck (LFIB) modules comprising efficient convolutions that enhance context integration. Additionally, improvements are made to the Flatten Transformer by enhancing local and global feature interaction to capture detailed semantic information. Incorporating a combination coefficient learning scheme in both LFIB and Transformer blocks facilitates improved feature interaction. Extensive experiments demonstrate that LMIINet excels in balancing accuracy and efficiency. With only 0.72M parameters and 11.74G FLOPs (Floating Point Operations Per Second), LMIINet achieves 72.0\% mIoU at 100 FPS (Frames Per Second) on the Cityscapes test set and 69.94\% mIoU (mean Intersection over Union) at 160 FPS on the CamVid test dataset using a single RTX2080Ti GPU.
翻译:近年来,将卷积神经网络(CNN)的局部建模能力与Transformer的全局依赖建模优势相结合,在语义分割领域引起了广泛关注。然而,巨大的计算负荷和较高的硬件内存需求仍然是其在实时场景中进一步应用的主要障碍。本文提出了一种用于实时语义分割的轻量级多信息交互网络(LMIINet),该网络在有效结合CNN与Transformer的同时,显著减少了冗余计算和内存占用。其核心在于引入了轻量级特征交互瓶颈(LFIB)模块,该模块采用高效卷积操作以增强上下文信息融合。此外,本文对Flatten Transformer进行了改进,通过加强局部与全局特征的交互来捕捉更精细的语义信息。在LFIB和Transformer模块中引入的组合系数学习机制进一步优化了特征交互过程。大量实验表明,LMIINet在精度与效率之间取得了卓越的平衡。在仅包含0.72M参数和11.74G FLOPs(每秒浮点运算次数)的条件下,LMIINet在Cityscapes测试集上以100 FPS(每秒帧数)的速度实现了72.0%的mIoU(平均交并比),在CamVid测试数据集上使用单张RTX2080Ti GPU以160 FPS的速度达到了69.94%的mIoU。