Spatial Context Improves the Integration of Text with Remote Sensing for Mapping Environmental Variables

Recent developments in natural language processing highlight text as an emerging data source for ecology. Textual resources carry unique information that can be used in complementarity with geospatial data sources, thus providing insights at the local scale into environmental conditions and properties hidden from more traditional data sources. Leveraging textual information in a spatial context presents several challenges. First, the contribution of textual data remains poorly defined in an ecological context, and it is unclear for which tasks it should be incorporated. Unlike ubiquitous satellite imagery or environmental covariates, the availability of textual data is sparse and irregular; its integration with geospatial data is not straightforward. In response to these challenges, this work proposes an attention-based approach that combines aerial imagery and geolocated text within a spatial neighbourhood, i.e. integrating contributions from several nearby observations. Our approach combines vision and text representations with a geolocation encoding, with an attention-based module that dynamically selects spatial neighbours that are useful for predictive tasks.The proposed approach is applied to the EcoWikiRS dataset, which combines high-resolution aerial imagery with sentences extracted from Wikipedia describing local environmental conditions across Switzerland. Our model is evaluated on the task of predicting 103 environmental variables from the SWECO25 data cube. Our approach consistently outperforms single-location or unimodal, i.e. image-only or text-only, baselines. When analysing variables by thematic groups, results show a significant improvement in performance for climatic, edaphic, population and land use/land cover variables, underscoring the benefit of including the spatial context when combining text and image data.

翻译：自然语言处理的最新进展凸显了文本作为生态学新兴数据源的重要性。文本资源携带独特信息，可与地理空间数据源互补使用，从而在局部尺度上揭示传统数据源难以捕捉的环境条件与属性。在空间上下文中利用文本信息面临若干挑战：首先，文本数据在生态学背景下的贡献尚未明确界定，其适用任务范围仍不清晰；与普遍可用的卫星影像或环境协变量不同，文本数据的可用性具有稀疏性和不规则性；其与地理空间数据的整合并非易事。为应对这些挑战，本研究提出一种基于注意力机制的方法，在空间邻域内（即整合多个邻近观测点的贡献）融合航空影像与地理定位文本。该方法通过结合视觉与文本表征及地理位置编码，利用基于注意力的模块动态选择对预测任务有益的空间邻域。所提方法应用于EcoWikiRS数据集——该数据集整合了高分辨率航空影像与从维基百科提取的描述瑞士各地环境条件的语句。我们在SWECO25数据立方体的103个环境变量预测任务上评估模型性能。相较于单点观测或单模态（仅影像或仅文本）基线模型，本方法始终表现出更优性能。按主题组分析变量时，结果显示在气候、土壤、人口及土地利用/土地覆盖变量上性能显著提升，这印证了结合文本与影像数据时纳入空间上下文的优势。