Land-cover underpins ecosystem services, hydrologic regulation, disaster-risk reduction, and evidence-based land planning; timely, accurate land-cover maps are therefore critical for environmental stewardship. Remote sensing-based land-cover classification offers a scalable route to such maps but is hindered by scarce and imbalanced annotations and by geometric distortions in high-resolution scenes. We propose LC4-DViT (Land-cover Creation for Land-cover Classification with Deformable Vision Transformer), a framework that combines generative data creation with a deformation-aware Vision Transformer. A text-guided diffusion pipeline uses GPT-4o-generated scene descriptions and super-resolved exemplars to synthesize class-balanced, high-fidelity training images, while DViT couples a DCNv4 deformable convolutional backbone with a Vision Transformer encoder to jointly capture fine-scale geometry and global context. On eight classes from the Aerial Image Dataset (AID)-Beach, Bridge, Desert, Forest, Mountain, Pond, Port, and River-DViT achieves 0.9572 overall accuracy, 0.9576 macro F1-score, and 0.9510 Cohen' s Kappa, improving over a vanilla ViT baseline (0.9274 OA, 0.9300 macro F1, 0.9169 Kappa) and outperforming ResNet50, MobileNetV2, and FlashInternImage. Cross-dataset experiments on a three-class SIRI-WHU subset (Harbor, Pond, River) yield 0.9333 overall accuracy, 0.9316 macro F1, and 0.8989 Kappa, indicating good transferability. An LLM-based judge using GPT-4o to score Grad-CAM heatmaps further shows that DViT' s attention aligns best with hydrologically meaningful structures. These results suggest that description-driven generative augmentation combined with deformation-aware transformers is a promising approach for high-resolution land-cover mapping.
翻译:土地覆盖是生态系统服务、水文调节、灾害风险降低和循证土地规划的基础;因此,及时、准确的土地覆盖图对于环境管理至关重要。基于遥感的土地覆盖分类为获取此类地图提供了一条可扩展的途径,但受到稀缺且不平衡的标注以及高分辨率场景中几何畸变的阻碍。我们提出了LC4-DViT(基于可变形视觉Transformer的土地覆盖生成与分类框架),该框架将生成式数据创建与变形感知的视觉Transformer相结合。一个文本引导的扩散流程利用GPT-4o生成的场景描述和超分辨率示例,合成了类别平衡、高保真的训练图像;而DViT则将DCNv4可变形卷积主干与视觉Transformer编码器耦合,以共同捕捉精细尺度的几何结构和全局上下文。在Aerial Image Dataset (AID)的八个类别上——海滩、桥梁、沙漠、森林、山脉、池塘、港口和河流——DViT实现了0.9572的总体准确率、0.9576的宏观F1分数和0.9510的Cohen's Kappa系数,优于原始ViT基线(0.9274 OA, 0.9300 macro F1, 0.9169 Kappa),并且超越了ResNet50、MobileNetV2和FlashInternImage。在一个三类的SIRI-WHU子集(港口、池塘、河流)上进行的跨数据集实验获得了0.9333的总体准确率、0.9316的宏观F1分数和0.8989的Kappa系数,表明其具有良好的可迁移性。一个基于LLM的评估器使用GPT-4o对Grad-CAM热图进行评分,进一步表明DViT的注意力机制与具有水文意义的结构最为吻合。这些结果表明,描述驱动的生成式增强与变形感知的Transformer相结合,是高分辨率土地覆盖制图的一种有前景的方法。