Aerial Image Segmentation is a top-down perspective semantic segmentation and has several challenging characteristics such as strong imbalance in the foreground-background distribution, complex background, intra-class heterogeneity, inter-class homogeneity, and tiny objects. To handle these problems, we inherit the advantages of Transformers and propose AerialFormer, which unifies Transformers at the contracting path with lightweight Multi-Dilated Convolutional Neural Networks (MD-CNNs) at the expanding path. Our AerialFormer is designed as a hierarchical structure, in which Transformer encoder outputs multi-scale features and MD-CNNs decoder aggregates information from the multi-scales. Thus, it takes both local and global contexts into consideration to render powerful representations and high-resolution segmentation. We have benchmarked AerialFormer on three common datasets including iSAID, LoveDA, and Potsdam. Comprehensive experiments and extensive ablation studies show that our proposed AerialFormer outperforms previous state-of-the-art methods with remarkable performance. Our source code will be publicly available upon acceptance.
翻译:航拍图像分割是一种自上而下的语义分割任务,具有前景-背景分布高度不均衡、背景复杂、类内异质性、类间同质性以及目标物体细小等多重挑战性特征。为应对这些问题,我们继承Transformer的优势,提出AerialFormer模型,该模型在收缩路径中统一采用Transformer架构,在扩张路径中引入轻量级多膨胀卷积神经网络。AerialFormer采用层级结构设计,其中Transformer编码器输出多尺度特征,多膨胀卷积神经网络解码器则聚合来自多尺度的信息。由此,模型同时兼顾局部与全局上下文,从而生成强大的特征表征与高分辨率分割结果。我们在iSAID、LoveDA和Potsdam三个常用数据集上对AerialFormer进行了基准测试。综合实验与大量消融研究表明,我们提出的AerialFormer以显著性能超越了先前的最优方法。源代码将在论文被接收后公开发布。