In the past decade, convolutional neural networks (CNNs) have shown prominence for semantic segmentation. Although CNN models have very impressive performance, the ability to capture global representation is still insufficient, which results in suboptimal results. Recently, Transformer achieved huge success in NLP tasks, demonstrating its advantages in modeling long-range dependency. Recently, Transformer has also attracted tremendous attention from computer vision researchers who reformulate the image processing tasks as a sequence-to-sequence prediction but resulted in deteriorating local feature details. In this work, we propose a lightweight real-time semantic segmentation network called LETNet. LETNet combines a U-shaped CNN with Transformer effectively in a capsule embedding style to compensate for respective deficiencies. Meanwhile, the elaborately designed Lightweight Dilated Bottleneck (LDB) module and Feature Enhancement (FE) module cultivate a positive impact on training from scratch simultaneously. Extensive experiments performed on challenging datasets demonstrate that LETNet achieves superior performances in accuracy and efficiency balance. Specifically, It only contains 0.95M parameters and 13.6G FLOPs but yields 72.8\% mIoU at 120 FPS on the Cityscapes test set and 70.5\% mIoU at 250 FPS on the CamVid test dataset using a single RTX 3090 GPU. The source code will be available at https://github.com/IVIPLab/LETNet.
翻译:在过去十年中,卷积神经网络(CNN)在语义分割领域表现突出。尽管CNN模型具有非常出色的性能,但其捕捉全局表征的能力仍显不足,导致结果并非最优。近年来,Transformer在自然语言处理任务中取得了巨大成功,展现了其在长距离依赖建模方面的优势。最近,Transformer也引起了计算机视觉研究者的广泛关注,他们将图像处理任务重新表述为序列到序列的预测,但这导致了局部特征细节的退化。在这项工作中,我们提出了一种名为LETNet的轻量级实时语义分割网络。LETNet以胶囊嵌入的方式将U形CNN与Transformer有效结合,以弥补各自的不足。同时,精心设计的轻量级扩张瓶颈(LDB)模块和特征增强(FE)模块对从头开始训练产生了积极影响。在具有挑战性的数据集上进行的大量实验表明,LETNet在准确性和效率平衡方面取得了卓越性能。具体而言,它仅包含0.95M参数和13.6G FLOPs,但在Cityscapes测试集上以120 FPS达到72.8%的mIoU,在CamVid测试数据集上使用单张RTX 3090 GPU以250 FPS达到70.5%的mIoU。源代码将在https://github.com/IVIPLab/LETNet提供。