Multimodal aerial data are used to monitor natural systems, and machine learning can significantly accelerate the classification of landscape features within such imagery to benefit ecology and conservation. It remains under-explored, however, how these multiple modalities ought to be fused in a deep learning model. As a step towards filling this gap, we study three strategies (Early fusion, Late fusion, and Mixture of Experts) for fusing thermal, RGB, and LiDAR imagery using a dataset of spatially-aligned orthomosaics in these three modalities. In particular, we aim to map three ecologically-relevant biophysical landscape features in African savanna ecosystems: rhino middens, termite mounds, and water. The three fusion strategies differ in whether the modalities are fused early or late, and if late, whether the model learns fixed weights per modality for each class or generates weights for each class adaptively, based on the input. Overall, the three methods have similar macro-averaged performance with Late fusion achieving an AUC of 0.698, but their per-class performance varies strongly, with Early fusion achieving the best recall for middens and water and Mixture of Experts achieving the best recall for mounds.
翻译:多模态航空数据被用于监测自然系统,而机器学习能显著加速此类影像中景观特征的分类,从而有益于生态学与保护生物学。然而,这些多模态数据应如何在深度学习模型中进行融合,目前仍缺乏深入探索。为填补这一空白,本研究以空间对齐的热红外、RGB与LiDAR正射影像数据集为基础,探讨了三种融合策略(早期融合、晚期融合与专家混合模型)。具体而言,我们旨在绘制非洲稀树草原生态系统中三种具有生态相关性的生物物理景观特征:犀牛粪堆、白蚁丘与水源。三种融合策略的区别在于:模态融合发生在早期或晚期阶段;若为晚期融合,模型是为每个类别学习固定的模态权重,还是根据输入自适应地生成各类别的权重。总体而言,三种方法的宏观平均性能相近(晚期融合的AUC达0.698),但其各类别性能差异显著:早期融合在粪堆和水源识别中取得最佳召回率,而专家混合模型在白蚁丘识别中实现最优召回率。