Variable importance assessment has become a crucial step in machine-learning applications when using complex learners, such as deep neural networks, on large-scale data. Removal-based importance assessment is currently the reference approach, particularly when statistical guarantees are sought to justify variable inclusion. It is often implemented with variable permutation schemes. On the flip side, these approaches risk misidentifying unimportant variables as important in the presence of correlations among covariates. Here we develop a systematic approach for studying Conditional Permutation Importance (CPI) that is model agnostic and computationally lean, as well as reusable benchmarks of state-of-the-art variable importance estimators. We show theoretically and empirically that $\textit{CPI}$ overcomes the limitations of standard permutation importance by providing accurate type-I error control. When used with a deep neural network, $\textit{CPI}$ consistently showed top accuracy across benchmarks. An empirical benchmark on real-world data analysis in a large-scale medical dataset showed that $\textit{CPI}$ provides a more parsimonious selection of statistically significant variables. Our results suggest that $\textit{CPI}$ can be readily used as drop-in replacement for permutation-based methods.
翻译:变量重要性评估已成为在大型数据中使用复杂学习器(如深度神经网络)时机器学习应用中的关键步骤。基于移除的重要性评估是目前的主流方法,特别是在需要统计保证来证明变量纳入合理性的情况下。该方法通常通过变量置换方案实现。然而,当协变量之间存在相关性时,这些方法容易将不重要的变量错误识别为重要变量。本文开发了一种系统性的方法来研究条件置换重要性(CPI),该方法具有模型无关性、计算简洁性,并提供了可复用的最新变量重要性估计器基准。我们从理论和实证两方面证明,$\textit{CPI}$通过提供精确的I类错误控制,克服了标准置换重要性的局限性。当与深度神经网络结合使用时,$\textit{CPI}$在各基准中始终表现出最佳准确性。在大规模医疗数据集的实际数据分析中,经验性基准表明,$\textit{CPI}$能更简洁地筛选出具有统计显著性的变量。我们的研究结果表明,$\textit{CPI}$可直接作为基于置换方法的即插即用替代方案。