Graph-based Semantic Calibration Network for Unaligned UAV RGBT Image Semantic Segmentation and A Large-scale Benchmark

Fine-grained RGBT image semantic segmentation is crucial for all-weather unmanned aerial vehicle (UAV) scene understanding. However, UAV RGBT semantic segmentation faces two coupled challenges: cross-modal spatial misalignment caused by sensor parallax and platform vibration, and severe semantic confusion among fine-grained ground objects under top-down aerial views. To address these issues, we propose a Graph-based Semantic Calibration Network (GSCNet) for unaligned UAV RGBT image semantic segmentation. Specifically, we design a Feature Decoupling and Alignment Module (FDAM) that decouples each modality into shared structural and private perceptual components and performs deformable alignment in the shared subspace, enabling robust spatial correction with reduced modality appearance interference. Moreover, we propose a Semantic Graph Calibration Module (SGCM) that explicitly encodes the hierarchical taxonomy and co-occurrence regularities among ground-object categories in UAV scenes into a structured category graph, and incorporates these priors into graph-attention reasoning to calibrate predictions of visually similar and rare categories.In addition, we construct the Unaligned RGB-Thermal Fine-grained (URTF) benchmark, to the best of our knowledge, the largest and most fine-grained benchmark for unaligned UAV RGBT image semantic segmentation, containing over 25,000 image pairs across 61 categories with realistic cross-modal misalignment. Extensive experiments on URTF demonstrate that GSCNet significantly outperforms state-of-the-art methods, with notable gains on fine-grained categories. The dataset is available at https://github.com/mmic-lcl/Datasets-and-benchmark-code.

翻译：细粒度RGB-T图像语义分割对于全天候无人机场景理解至关重要。然而，无人机RGB-T语义分割面临两个耦合挑战：由传感器视差和平台振动引起的跨模态空间不对齐，以及俯视视角下细粒度地物间的严重语义混淆。为解决这些问题，我们提出基于图的语义校准网络（GSCNet），用于未对齐的无人机RGB-T图像语义分割。具体而言，设计特征解耦与对齐模块（FDAM），将每种模态解耦为共享结构分量和私有感知分量，并在共享子空间中进行可变形对齐，实现鲁棒的空间校正并减少模态外观干扰。此外，提出语义图校准模块（SGCM），将无人机场景中地物类别间的层次化分类和共现规律显式编码为结构化类别图，并将其先验信息融入图注意力推理中，以校准视觉相似和稀有类别的预测。同时，构建未对齐RGB-热红外细粒度基准（URTF），据我们所知，这是最大的未对齐无人机RGB-T图像语义分割细粒度基准，包含超过25000对图像，覆盖61个类别，并具有真实的跨模态不对齐。在URTF上的大量实验表明，GSCNet显著优于现有最先进方法，尤其在细粒度类别上取得显著提升。数据集已公开于 https://github.com/mmic-lcl/Datasets-and-benchmark-code。