Convolutional neural networks (CNNs) and Transformers have shown advanced accuracy in crack detection under certain conditions. Yet, the fixed local attention can compromise the generalisation of CNNs, and the quadratic complexity of the global self-attention restricts the practical deployment of Transformers. Given the emergence of the new-generation architecture of Mamba, this paper proposes a Vision Mamba (VMamba)-based framework for crack segmentation on concrete, asphalt, and masonry surfaces, with high accuracy, generalisation, and less computational complexity. Having 15.6% - 74.5% fewer parameters, the encoder-decoder network integrated with VMamba could obtain up to 2.8% higher mDS than representative CNN-based models while showing about the same performance as Transformer-based models. Moreover, the VMamba-based encoder-decoder network could process high-resolution image input with up to 90.6% lower floating-point operations.
翻译:卷积神经网络(CNN)与Transformer在特定条件下已展现出先进的裂缝检测精度。然而,CNN的固定局部注意力机制可能影响其泛化能力,而Transformer全局自注意力的二次计算复杂度限制了其实际部署。鉴于新一代Mamba架构的出现,本文提出一种基于视觉Mamba(VMamba)的裂缝分割框架,适用于混凝土、沥青及砌体表面,具备高精度、强泛化能力和较低计算复杂度。集成VMamba的编码器-解码器网络参数量减少15.6%-74.5%,其平均戴斯相似系数(mDS)较典型CNN模型最高可提升2.8%,同时与基于Transformer的模型性能相当。此外,该网络处理高分辨率图像输入时,浮点运算量最高可降低90.6%。