HiT: Building Mapping with Hierarchical Transformers

Deep learning-based methods have been extensively explored for automatic building mapping from high-resolution remote sensing images over recent years. While most building mapping models produce vector polygons of buildings for geographic and mapping systems, dominant methods typically decompose polygonal building extraction in some sub-problems, including segmentation, polygonization, and regularization, leading to complex inference procedures, low accuracy, and poor generalization. In this paper, we propose a simple and novel building mapping method with Hierarchical Transformers, called HiT, improving polygonal building mapping quality from high-resolution remote sensing images. HiT builds on a two-stage detection architecture by adding a polygon head parallel to classification and bounding box regression heads. HiT simultaneously outputs building bounding boxes and vector polygons, which is fully end-to-end trainable. The polygon head formulates a building polygon as serialized vertices with the bidirectional characteristic, a simple and elegant polygon representation avoiding the start or end vertex hypothesis. Under this new perspective, the polygon head adopts a transformer encoder-decoder architecture to predict serialized vertices supervised by the designed bidirectional polygon loss. Furthermore, a hierarchical attention mechanism combined with convolution operation is introduced in the encoder of the polygon head, providing more geometric structures of building polygons at vertex and edge levels. Comprehensive experiments on two benchmarks (the CrowdAI and Inria datasets) demonstrate that our method achieves a new state-of-the-art in terms of instance segmentation and polygonal metrics compared with state-of-the-art methods. Moreover, qualitative results verify the superiority and effectiveness of our model under complex scenes.

翻译：近年来，基于深度学习的方法被广泛探索用于从高分辨率遥感影像中自动构建建筑地图。尽管大多数建筑地图模型生成矢量多边形以供地理和制图系统使用，主流方法通常将多边形建筑提取分解为若干子问题，包括分割、多边形化和正则化，导致推理过程复杂、精度低且泛化能力差。本文提出一种简单且新颖的基于层次化Transformers的建筑地图构建方法，称为HiT，用于提升从高分辨率遥感影像中提取建筑多边形地图的质量。HiT基于两阶段检测架构，通过添加一个与分类头和边界框回归头并行的多边形头实现。HiT能够同时输出建筑边界框和矢量多边形，且完全支持端到端训练。该多边形头将建筑多边形表示为具有双向特征的序列化顶点，这是一种简单优雅的多边形表示方法，无需假设起始或终止顶点。基于这一新视角，多边形头采用Transformer编码器-解码器架构来预测序列化顶点，并通过设计的双向多边形损失进行监督。此外，在多边形头的编码器中引入了与卷积操作相结合的层次化注意力机制，从而在顶点和边级别提供建筑多边形的更多几何结构信息。在两项基准数据集（CrowdAI和Inria数据集）上的综合实验表明，与现有最先进方法相比，我们的方法在实例分割和多边形度量指标上达到了新的最优水平。同时，定性结果验证了模型在复杂场景下的优越性和有效性。