Hierarchical Fusion of Local and Global Visual Features with Mixture-of-Experts for Remote Sensing Image Scene Classification

Remote sensing image scene classification remains a challenging task, primarily due to the complex spatial structures and multi-scale characteristics of ground objects. Although CNN-based methods excel at extracting local inductive biases, and Mamba-based approaches demonstrate impressive capabilities in efficiently capturing global sequential context, relying on a single paradigm restricts the model's ability to simultaneously characterize fine-grained textures and complex spatial structures. To tackle this, we propose a parallel heterogeneous encoder, a hierarchical fusion module designed to achieve effective local-global co-representation. It consists of two parallel pathways: a local visual encoder for extracting multi-scale local visual features, and a global visual encoder for capturing efficient global visual features. The core innovation lies in its hierarchical fusion module, which progressively aggregates multi-scale features from both pathways, enabling dynamic cross-level feature interaction and contextual reconstruction to produce highly discriminative representations. These fused features are then adaptively routed through a mixture-of-experts classifier head, which dynamically dispatches them to the most suitable experts for fine-grained scene recognition. Experiments on AID, NWPU-RESISC45, and UC Merced show that our model achieves 93.72%, 95.54%, and 96.92% accuracy, surpassing SOTA methods with an optimal balance of performance and efficiency. Code is available at https://anonymous.4open.science/r/classification-41DF.

翻译：遥感图像场景分类仍是一项具有挑战性的任务，这主要源于地物复杂的空间结构和多尺度特性。尽管基于CNN的方法擅长提取局部归纳偏置，而基于Mamba的方法在高效捕获全局序列上下文方面展现出令人印象深刻的能力，但依赖单一范式限制了模型同时表征细粒度纹理和复杂空间结构的能力。为解决此问题，我们提出了一种并行异构编码器，即一个旨在实现有效局部-全局协同表征的层次融合模块。它由两条并行路径组成：一条用于提取多尺度局部视觉特征的局部视觉编码器，以及一条用于捕获高效全局视觉特征的全局视觉编码器。其核心创新在于其层次融合模块，该模块逐步聚合来自两条路径的多尺度特征，实现动态的跨层级特征交互与上下文重建，以生成高度判别性的表征。这些融合特征随后通过一个专家混合分类器头进行自适应路由，该分类器头动态地将它们分派给最合适的专家以进行细粒度场景识别。在AID、NWPU-RESISC45和UC Merced数据集上的实验表明，我们的模型分别取得了93.72%、95.54%和96.92%的准确率，在性能与效率的最佳平衡下超越了现有最优方法。代码可在 https://anonymous.4open.science/r/classification-41DF 获取。