Aerial Image Segmentation is a top-down perspective semantic segmentation and has several challenging characteristics such as strong imbalance in the foreground-background distribution, complex background, intra-class heterogeneity, inter-class homogeneity, and tiny objects. To handle these problems, we inherit the advantages of Transformers and propose AerialFormer, which unifies Transformers at the contracting path with lightweight Multi-Dilated Convolutional Neural Networks (MD-CNNs) at the expanding path. Our AerialFormer is designed as a hierarchical structure, in which Transformer encoder outputs multi-scale features and MD-CNNs decoder aggregates information from the multi-scales. Thus, it takes both local and global contexts into consideration to render powerful representations and high-resolution segmentation. We have benchmarked AerialFormer on three common datasets including iSAID, LoveDA, and Potsdam. Comprehensive experiments and extensive ablation studies show that our proposed AerialFormer outperforms previous state-of-the-art methods with remarkable performance. Our source code will be publicly available upon acceptance.
翻译:摘要:航拍图像分割是一种自上而下的语义分割任务,具有前景-背景分布严重失衡、背景复杂、类内异质性、类间同质性以及目标微小等若干挑战性特征。为应对这些问题,我们继承Transformer的优势,提出AerialFormer模型,该模型在收缩路径上统一使用Transformer结构,在扩展路径上采用轻量级多扩张卷积神经网络(MD-CNNs)。AerialFormer采用分层架构设计,其中Transformer编码器输出多尺度特征,MD-CNNs解码器聚合多尺度信息。由此,模型同时兼顾局部与全局上下文,生成强表征能力与高分辨率分割结果。我们在iSAID、LoveDA和Potsdam三个常见数据集上对AerialFormer进行了基准测试。综合实验与大量消融研究表明,我们提出的AerialFormer以卓越性能超越了现有最优方法。相关源代码将在论文被接收后公开发布。