Explainable Flood Segmentation on Sentinel-1 SAR Imagery: A Comparative Study of CNN and Transformer Architectures

Rapid and accurate flood prediction is essential for disaster response and mitigation planning. Synthetic Aperture Radar (SAR) sensors in satellites are well-suited for this purpose because they operate independently of weather and daylight conditions. Although SAR-based data enable all-weather flood monitoring, distinguishing flooded land from permanent water remains a significant challenge, particularly when flooding is defined strictly as inundated land. This study provides a comprehensive comparison of convolutional neural network (CNN) and vision transformer architectures for multi-class flood segmentation using Sentinel-1 SAR imagery, specifically trained to separate flooded land from permanent water bodies and land. Three state-of-the-art (SOTA)CNN-based models, U-Net, U-Net++, and DeepLabV3 with ResNet-34 backbone, and three SegFormer variants (b0,b1,b2) were evaluated in two benchmark datasets, the ETCI NASA dataset and SenFloods11, using scene-based data splits to ensure a realistic assessment of spatial generalization. The results demonstrate that SegFormer-b2 significantly outperforms the U-Net baseline on the ETCI dataset (higher flood IoU across all 7 test scenes in the Wilcoxon signed-rank test), while after fine-tuning on Sen1Floods11, the advantage narrows to within the range of scene variability and is concentrated in spatially fragmented flood events. The study includes both qualitative and quantitative explainability techniques to visually comprehend model decisions and systematically assess prediction reliability. Qualitative analysis reveals that SegFormer-b2 produces more spatially coherent Grad-CAM activations focused on flood-relevant features, while U-Net generates more informative uncertainty estimates along flood boundaries.

翻译：快速且准确的洪水预测对于灾害响应和减灾规划至关重要。卫星上的合成孔径雷达传感器因不受天气和光照条件影响，非常适合此用途。尽管基于SAR的数据支持全天候洪水监测，但区分被淹没的陆地与永久水体仍是一项重大挑战，尤其在洪水被严格定义为被淹没的陆地时。本研究对使用 Sentinel-1 SAR 影像进行多类别洪水分割的卷积神经网络和视觉Transformer架构进行了全面比较，其训练目标明确为区分被淹没的陆地、永久水体与陆地。研究评估了三个基于CNN的先进模型——U-Net、U-Net++和基于ResNet-34骨干网络的 DeepLabV3，以及三个 SegFormer 变体（b0、b1、b2），并在两个基准数据集（ETCI NASA 数据集和 SenFloods11）上，采用基于场景的数据划分方法以确保空间泛化能力评估的现实性。结果表明，SegFormer-b2 在 ETCI 数据集上显著优于 U-Net 基线（在 Wilcoxon 符号秩检验中，所有7个测试场景的洪水IoU均更高）；而在 Sen1Floods11 上进行微调后，优势缩小至场景变异的范围内，并集中在空间破碎的洪水事件上。本研究采用了定性和定量可解释性技术，以直观理解模型决策并系统评估预测可靠性。定性分析显示，SegFormer-b2 能产生更聚焦于洪水相关特征的空间连贯性 Grad-CAM 激活图，而 U-Net 则在洪水边界处生成信息量更丰富的不确定性估计。