Machine learning (ML) is a tool to exploit remote sensing data for the monitoring and implementation of the United Nations' Sustainable Development Goals (SDGs). In this paper, we report on a meta-analysis to evaluate the performance of ML applied to remote sensing data to monitor SDGs. Specifically, we aim to 1) estimate the average performance; 2) determine the degree of heterogeneity between and within studies; and 3) assess how study features influence model performance. Using PRISMA guidelines, a search was performed across multiple academic databases to identify potentially relevant studies. A random sample of 200 was screened by three reviewers, resulting in 86 trials within 20 studies with 14 study features. Overall accuracy was the most reported performance metric. It was analyzed using double arcsine transformation and a three-level random effects model. The average overall accuracy of the best model was 0.90 [0.86, 0.92]. There was considerable heterogeneity in model performance, 64% of which was between studies. The only significant feature was the prevalence of the majority class, which explained 61% of the between-study heterogeneity. None of the other thirteen features added value to the model. The most important contributions of this paper are the following two insights. 1) Overall accuracy is the most popular performance metric, yet arguably the least insightful. Its sensitivity to class imbalance makes it necessary to normalize it, which is far from common practice. 2) The field needs to standardize the reporting. Reporting of the confusion matrix for independent test sets is the most important ingredient for between-study comparisons of ML classifiers. These findings underscore the need for robust and comparable evaluation metrics in machine learning applications to ensure reliable and actionable insights for effective SDG monitoring and policy formulation.
翻译:机器学习是利用遥感数据进行联合国可持续发展目标监测与实施的重要工具。本文通过一项元分析,评估了应用于遥感数据监测可持续发展目标的机器学习模型性能。具体而言,本研究旨在:1)估计平均性能水平;2)确定研究间与研究内的异质性程度;3)评估研究特征如何影响模型性能。依据PRISMA指南,我们在多个学术数据库中进行检索以识别潜在相关研究。由三位评审员对200项随机样本进行筛选,最终纳入20项研究中的86项试验数据,涵盖14项研究特征。总体准确率是最常报告的性能指标。我们采用双反正弦变换和三层随机效应模型对其进行分析。最佳模型的平均总体准确率为0.90 [0.86, 0.92]。模型性能存在显著异质性,其中64%来源于研究间差异。唯一具有显著影响的特征是多数类别的普遍性,这解释了61%的研究间异质性。其余十三项特征均未对模型产生显著贡献。本文最重要的贡献在于以下两点发现:1)总体准确率虽是最常用的性能指标,但可能是信息量最有限的指标。其对类别不平衡的敏感性要求必须进行标准化处理,而这远未成为普遍实践。2)该领域需要建立标准化报告规范。针对独立测试集的混淆矩阵报告,是进行机器学习分类器跨研究比较的最重要基础。这些发现强调了在机器学习应用中建立稳健且可比较的评估指标的必要性,以确保为有效的可持续发展目标监测和政策制定提供可靠且可操作的见解。