The aim of this study is to define importance of predictors for black box machine learning methods, where the prediction function can be complex and cannot be represented by statistical parameters. In this paper we defined a ``Generalized Variable Importance Metric (GVIM)'' using the true conditional expectation function for a continuous or a binary response variable. We further showed that the defined GVIM can be represented as a function of the Conditional Average Treatment Effect (CATE) for multinomial and continuous predictors. Then we propose how the metric can be estimated using using any machine learning models. Finally using simulations we evaluated the properties of the estimator when estimated from XGBoost, Random Forest and a mis-specified generalized additive model.
翻译:本研究旨在定义黑箱机器学习方法中预测变量的重要性,其中预测函数可能十分复杂且无法用统计参数表示。本文利用连续或二分类响应变量的真实条件期望函数,定义了一种"广义变量重要性度量(GVIM)"。进一步地,我们证明了所定义的GVIM可表示为多分类和连续预测变量的条件平均处理效应(CATE)的函数。随后,我们提出了如何利用任意机器学习模型对该度量进行估计的方法。最后,通过仿真实验,我们评估了基于XGBoost、随机森林以及错误设定的广义可加模型估计该度量时的性质。