HiT: Building Mapping with Hierarchical Transformers

Deep learning-based methods have been extensively explored for automatic building mapping from high-resolution remote sensing images over recent years. While most building mapping models produce vector polygons of buildings for geographic and mapping systems, dominant methods typically decompose polygonal building extraction in some sub-problems, including segmentation, polygonization, and regularization, leading to complex inference procedures, low accuracy, and poor generalization. In this paper, we propose a simple and novel building mapping method with Hierarchical Transformers, called HiT, improving polygonal building mapping quality from high-resolution remote sensing images. HiT builds on a two-stage detection architecture by adding a polygon head parallel to classification and bounding box regression heads. HiT simultaneously outputs building bounding boxes and vector polygons, which is fully end-to-end trainable. The polygon head formulates a building polygon as serialized vertices with the bidirectional characteristic, a simple and elegant polygon representation avoiding the start or end vertex hypothesis. Under this new perspective, the polygon head adopts a transformer encoder-decoder architecture to predict serialized vertices supervised by the designed bidirectional polygon loss. Furthermore, a hierarchical attention mechanism combined with convolution operation is introduced in the encoder of the polygon head, providing more geometric structures of building polygons at vertex and edge levels. Comprehensive experiments on two benchmarks (the CrowdAI and Inria datasets) demonstrate that our method achieves a new state-of-the-art in terms of instance segmentation and polygonal metrics compared with state-of-the-art methods. Moreover, qualitative results verify the superiority and effectiveness of our model under complex scenes.

翻译：摘要：近年来，基于深度学习的方法已被广泛用于从高分辨率遥感图像中自动提取建筑群。尽管大多数建筑群映射模型可为地理和制图系统生成矢量多边形，主流方法通常将多边形建筑提取分解为分割、多边形化和正则化等子问题，导致推理流程复杂、精度低且泛化能力差。本文提出一种简单新颖的基于分层Transformer的建筑群映射方法，即HiT，以提高高分辨率遥感图像中多边形建筑群映射的质量。HiT基于两阶段检测架构，在多边形头部分类头和边界框回归头的基础上新增多边形头部，可同时输出建筑边界框和矢量多边形，实现完全端到端训练。该多边形头部将建筑多边形表示为具有双向特征的序列化顶点，这是一种简单优雅的多边形表示方法，避免了起点或终点假设。基于这一新视角，多边形头部采用Transformer编码器-解码器架构预测序列化顶点，并通过设计的双向多边形损失函数进行监督。此外，编码器中引入结合卷积操作的分层注意力机制，可在顶点和边层级提供更丰富的建筑多边形几何结构。在CrowdAI和Inria两个基准数据集上的全面实验表明，与现有先进方法相比，该方法在实例分割和多边形度量指标上均达到新高度。定性结果进一步验证了复杂场景下模型的优越性和有效性。