While integrating multiple modalities has the potential to improve environmental monitoring, current approaches struggle to combine data sources with heterogeneous formats or contents. A central difficulty arises when combining continuous gridded data (e.g., remote sensing) with sparse and irregular point observations such as species records. Existing geostatistical and deep-learning-based approaches typically operate on a single modality or focus on spatially aligned inputs, and thus cannot seamlessly overcome this difficulty. We propose a Geolocation-Aware MultiModal Approach (GAMMA), a transformer-based fusion approach designed to integrate heterogeneous ecological data using explicit spatial context. Instead of interpolating observations into a common grid, GAMMA first represents all inputs as location-aware embeddings that preserve spatial relationships between samples. GAMMA dynamically selects relevant neighbours across modalities and spatial scales, enabling the model to jointly exploit continuous remote sensing imagery and sparse geolocated observations. We evaluate GAMMA on the task of predicting 103 environmental variables from the SWECO25 data cube across Switzerland. Inputs combine aerial imagery with biodiversity observations from GBIF and textual habitat descriptions from Wikipedia, provided by the EcoWikiRS dataset. Experiments show that multimodal fusion consistently improves prediction performance over single-modality baselines and that explicit spatial context further enhances model accuracy. The flexible architecture of GAMMA also allows to analyse the contribution of each modality through controlled ablation experiments. These results demonstrate the potential of location-aware multimodal learning for integrating heterogeneous ecological data and for supporting large-scale environmental mapping tasks and biodiversity monitoring.
翻译:尽管整合多种模态有望改善环境监测,但现有方法难以融合格式或内容异质的数据源。核心难点在于如何将连续网格化数据(如遥感影像)与稀疏且不规则的点状观测(如物种记录)相结合。现有地统计学与深度学习方法通常仅处理单一模态或聚焦于空间对齐的输入,因而无法无缝克服这一困难。我们提出一种地理位置感知多模态方法(GAMMA),这是一种基于Transformer的融合方法,旨在利用显式空间上下文整合异质生态数据。GAMMA并非将观测数据插值到公共网格,而是首先将所有输入表示为保留样本间空间关系的位置感知嵌入。GAMMA能够跨模态和空间尺度动态选择相关邻域,使得模型可联合利用连续遥感影像与稀疏的地理定位观测。我们在瑞士全域的SWECO25数据立方体上评估了GAMMA预测103项环境变量的能力。输入数据融合了来自EcoWikiRS数据集的航拍影像、GBIF生物多样性观测数据以及维基百科文本栖息地描述。实验表明,多模态融合相较于单模态基线始终能提升预测性能,而显式空间上下文进一步增强了模型精度。GAMMA的灵活架构还允许通过可控消融实验分析各模态的贡献。这些结果证明了位置感知多模态学习在整合异质生态数据、支持大规模环境制图任务及生物多样性监测方面的潜力。