In this paper, we propose a novel multimodal framework, Multimodal Language-Guided Network (MMLGNet), to align heterogeneous remote sensing modalities like Hyperspectral Imaging (HSI) and LiDAR with natural language semantics using vision-language models such as CLIP. With the increasing availability of multimodal Earth observation data, there is a growing need for methods that effectively fuse spectral, spatial, and geometric information while enabling semantic-level understanding. MMLGNet employs modality-specific encoders and aligns visual features with handcrafted textual embeddings in a shared latent space via bi-directional contrastive learning. Inspired by CLIP's training paradigm, our approach bridges the gap between high-dimensional remote sensing data and language-guided interpretation. Notably, MMLGNet achieves strong performance with simple CNN-based encoders, outperforming several established multimodal visual-only methods on two benchmark datasets, demonstrating the significant benefit of language supervision. Codes are available at https://github.com/AdityaChaudhary2913/CLIP_HSI.
翻译:本文提出了一种新颖的多模态框架——多模态语言引导网络(MMLGNet),旨在利用CLIP等视觉-语言模型,将高光谱成像(HSI)与激光雷达(LiDAR)等异构遥感模态与自然语言语义进行对齐。随着多模态对地观测数据的日益丰富,亟需能够有效融合光谱、空间与几何信息,并实现语义级理解的方法。MMLGNet采用模态专用编码器,通过双向对比学习将视觉特征与人工构建的文本嵌入对齐至共享潜在空间。受CLIP训练范式的启发,本方法弥合了高维遥感数据与语言引导解译之间的鸿沟。值得注意的是,MMLGNet仅使用基于CNN的简单编码器即取得了优异性能,在两个基准数据集上超越多种成熟的多模态纯视觉方法,证明了语言监督的重要价值。代码公开于:https://github.com/AdityaChaudhary2913/CLIP_HSI。