Conditional Feature Importance for Mixed Data

Despite the popularity of feature importance (FI) measures in interpretable machine learning, the statistical adequacy of these methods is rarely discussed. From a statistical perspective, a major distinction is between analyzing a variable's importance before and after adjusting for covariates - i.e., between $\textit{marginal}$ and $\textit{conditional}$ measures. Our work draws attention to this rarely acknowledged, yet crucial distinction and showcases its implications. Further, we reveal that for testing conditional FI, only few methods are available and practitioners have hitherto been severely restricted in method application due to mismatching data requirements. Most real-world data exhibits complex feature dependencies and incorporates both continuous and categorical data (mixed data). Both properties are oftentimes neglected by conditional FI measures. To fill this gap, we propose to combine the conditional predictive impact (CPI) framework with sequential knockoff sampling. The CPI enables conditional FI measurement that controls for any feature dependencies by sampling valid knockoffs - hence, generating synthetic data with similar statistical properties - for the data to be analyzed. Sequential knockoffs were deliberately designed to handle mixed data and thus allow us to extend the CPI approach to such datasets. We demonstrate through numerous simulations and a real-world example that our proposed workflow controls type I error, achieves high power and is in line with results given by other conditional FI measures, whereas marginal FI metrics result in misleading interpretations. Our findings highlight the necessity of developing statistically adequate, specialized methods for mixed data.

翻译：尽管特征重要性度量在可解释机器学习中广受欢迎，但这些方法的统计充分性却鲜有讨论。从统计学角度来看，一个关键区别在于分析变量在协变量调整前与调整后的重要性——即边际度量与条件度量。本工作提请关注这一鲜被认知却至关重要的区别，并揭示其影响。进一步地，我们发现测试条件特征重要性的可用方法极少，且实践者因数据要求不匹配而长期受到方法应用的严重限制。大多数现实数据呈现复杂的特征依赖关系，并同时包含连续和分类数据（混合数据）。条件特征重要性度量往往忽略这两个特性。为填补这一空白，我们提出将条件预测影响框架与顺序敲除抽样相结合。通过生成具有相似统计特性的合成数据（即有效敲除变量），条件预测影响框架可在控制任何特征依赖关系的前提下实现条件特征重要性度量。顺序敲除法专为处理混合数据而设计，因此可将其扩展至此类数据集。通过大量模拟实验和真实案例验证，我们提出的工作流能控制第一类错误、获得较高统计功效，且与其他条件特征重要性度量结果一致，而边际特征重要性度量则导致误导性解释。研究结果凸显了为混合数据开发统计充分且专门方法的必要性。