Variable importance produced by Random Forests (RF) is used widely in statistical data analysis, and has played an important role in a variety of tasks such as assisting model interpretation, model selection and diagnosis, and cost-bounded learning etc. However, the calculation of variable importance in RF does not take into account of the correlations among variables, and variables that are correlated to many other variables tend to receive a lower importance index or being completely masked (i.e., with an importance index near zero) by other strongly correlated variables. To prevent influence from unwanted correlated variables in calculating variable importance, we propose to group variables by their conditional correlations (conditional on the response variable). We explore two computationally efficient options, with one grouping variables individually, and then separates the variable of interest from all correlated variables, while the other uses clustering to group variables according to their pair-wise conditional correlations. Our experiments show that both lead to sensible corrections to the importance of variables.
翻译:随机森林(RF)产生的变量重要性在统计数据分析中广泛应用,并在辅助模型解释、模型选择与诊断、成本受限学习等多种任务中发挥重要作用。然而,RF中变量重要性的计算未考虑变量间的相关性,与其他多个变量相关的变量往往会获得较低的重要性指数,或完全被其他强相关变量掩盖(即重要性指数接近零)。为避免计算变量重要性时受到不相关相关变量的影响,我们提出根据变量间的条件相关性(以响应变量为条件)对变量进行分组。我们探索了两种计算高效的方案:一种是将变量逐一分组,并将目标变量与所有相关变量分离;另一种是利用聚类方法,根据变量对之间的条件相关性进行分组。实验表明,这两种方法都能对变量重要性进行合理的修正。