The present work provides an application of Global Sensitivity Analysis to supervised machine learning methods such as Random Forests. These methods act as black boxes, selecting features in high--dimensional data sets as to provide accurate classifiers in terms of prediction when new data are fed into the system. In supervised machine learning, predictors are generally ranked by importance based on their contribution to the final prediction. Global Sensitivity Analysis is primarily used in mathematical modelling to investigate the effect of the uncertainties of the input variables on the output. We apply it here as a novel way to rank the input features by their importance to the explainability of the data generating process, shedding light on how the response is determined by the dependence structure of its predictors. A simulation study shows that our proposal can be used to explore what advances can be achieved either in terms of efficiency, explanatory ability, or simply by way of confirming existing results.
翻译:本研究将全局敏感性分析应用于监督式机器学习方法(如随机森林)。这些方法作为黑箱系统,在高维数据集中筛选特征,以便在新数据输入系统时提供具有预测准确性的分类器。在监督式机器学习中,预测变量通常根据其对最终预测的贡献度进行重要性排序。全局敏感性分析主要用于数学模型研究,以探究输入变量不确定性对输出的影响。本文将其作为一种新颖方法,根据输入特征对数据生成过程可解释性的重要性进行排序,从而揭示响应变量如何通过其预测变量的依赖结构被确定。模拟研究表明,我们的方法可用于探索在效率、解释能力方面的潜在提升,或仅作为验证现有结果的途径。