In many studies, we want to determine the influence of certain features on a dependent variable. More specifically, we are interested in the strength of the influence -- i.e., is the feature relevant? -- and, if so, how the feature influences the dependent variable. Recently, data-driven approaches such as \emph{random forest regression} have found their way into applications (Boulesteix et al., 2012). These models allow to directly derive measures of feature importance, which are a natural indicator of the strength of the influence. For the relevant features, the correlation or rank correlation between the feature and the dependent variable has typically been used to determine the nature of the influence. More recent methods, some of which can also measure interactions between features, are based on a modeling approach. In particular, when machine learning models are used, SHAP scores are a recent and prominent method to determine these trends (Lundberg et al., 2017). In this paper, we introduce a novel notion of feature importance based on the well-studied Gram-Schmidt decorrelation method. Furthermore, we propose two estimators for identifying trends in the data using random forest regression, the so-called absolute and relative transversal rate. We empirically compare the properties of our estimators with those of well-established estimators on a variety of synthetic and real-world datasets.
翻译:在许多研究中,我们希望确定某些特征对因变量的影响。具体而言,我们关注影响力的强度——即该特征是否相关?——如果相关,该特征如何影响因变量。近年来,诸如随机森林回归等数据驱动方法已广泛应用于实际研究(Boulesteix 等,2012)。这些模型能够直接导出特征重要性度量,这自然成为影响强度的指标。对于相关特征,通常使用特征与因变量之间的相关性或秩相关性来确定影响的性质。更新颖的方法(部分能同时测量特征间交互作用)基于建模方法。特别是,当使用机器学习模型时,SHAP分数是近年来确定这些趋势的重要方法(Lundberg 等,2017)。本文提出了一种基于经典Gram-Schmidt去相关方法的新型特征重要性概念。此外,我们提出了两种基于随机森林回归识别数据趋势的估计量——即所谓的绝对横向率与相对横向率。我们在多种合成数据集和真实数据集上,将所提估计量的特性与经典估计量进行了实证比较。