Remote sensing image scene classification remains a challenging task, primarily due to the complex spatial structures and multi-scale characteristics of ground objects. Existing approaches see CNNs excel at modeling local textures, while Transformers excel at capturing global context. However, efficiently integrating them remains a bottleneck due to the high computational cost of Transformers. To tackle this, we propose AFM-Net, a novel Advanced Hierarchical Fusing framework that achieves effective local and global co-representation through two pathways: a CNN branch for extracting hierarchical visual priors, and a Mamba branch for efficient global sequence modeling. The core innovation of AFM-Net lies in its Hierarchical Fusion Mechanism, which progressively aggregates multi-scale features from both pathways, enabling dynamic cross-level feature interaction and contextual reconstruction to produce highly discriminative representations. These fused features are then adaptively routed through a Mixture-of-Experts classifier module, which dispatches them to the most suitable experts for fine-grained scene recognition. Experiments on AID, NWPU-RESISC45, and UC Merced show that AFM-Net obtains 93.72, 95.54, and 96.92 percent accuracy, surpassing state-of-the-art methods with balanced performance and efficiency. Code is available at https://github.com/tangyuanhao-qhu/AFM-Net.
翻译:遥感图像场景分类仍然是一项具有挑战性的任务,主要源于地物复杂的空间结构和多尺度特性。现有方法中,CNN擅长建模局部纹理,而Transformer擅长捕捉全局上下文。然而,由于Transformer的高计算成本,如何高效整合二者仍是一个瓶颈。为解决此问题,我们提出了AFM-Net——一种新颖的高级层次化融合框架,通过两条路径实现有效的局部与全局协同表征:CNN分支用于提取层次化视觉先验,Mamba分支用于高效的全局序列建模。AFM-Net的核心创新在于其层次化融合机制,该机制渐进式聚合来自两条路径的多尺度特征,实现动态跨层级特征交互与上下文重建,从而生成高度判别性的表征。这些融合特征随后通过混合专家分类器模块进行自适应路由,将其分配至最合适的专家网络进行细粒度场景识别。在AID、NWPU-RESISC45和UC Merced数据集上的实验表明,AFM-Net分别取得了93.72%、95.54%和96.92%的准确率,在性能与效率平衡方面超越了现有最先进方法。代码公开于https://github.com/tangyuanhao-qhu/AFM-Net。