Conditional Feature Importance for Mixed Data

Despite the popularity of feature importance (FI) measures in interpretable machine learning, the statistical adequacy of these methods is rarely discussed. From a statistical perspective, a major distinction is between analyzing a variable's importance before and after adjusting for covariates - i.e., between $\textit{marginal}$ and $\textit{conditional}$ measures. Our work draws attention to this rarely acknowledged, yet crucial distinction and showcases its implications. Further, we reveal that for testing conditional FI, only few methods are available and practitioners have hitherto been severely restricted in method application due to mismatching data requirements. Most real-world data exhibits complex feature dependencies and incorporates both continuous and categorical data (mixed data). Both properties are oftentimes neglected by conditional FI measures. To fill this gap, we propose to combine the conditional predictive impact (CPI) framework with sequential knockoff sampling. The CPI enables conditional FI measurement that controls for any feature dependencies by sampling valid knockoffs - hence, generating synthetic data with similar statistical properties - for the data to be analyzed. Sequential knockoffs were deliberately designed to handle mixed data and thus allow us to extend the CPI approach to such datasets. We demonstrate through numerous simulations and a real-world example that our proposed workflow controls type I error, achieves high power and is in line with results given by other conditional FI measures, whereas marginal FI metrics result in misleading interpretations. Our findings highlight the necessity of developing statistically adequate, specialized methods for mixed data.

翻译：尽管特征重要性（FI）指标在可解释机器学习中广受欢迎，但这些方法的统计充分性却鲜有讨论。从统计学角度而言，一个关键区别在于分析协变量调整前后变量重要性的差异——即边缘（marginal）与条件（conditional）指标之分。本研究聚焦于这一鲜被认知却至关重要的区别，并阐述其深远影响。此外，我们发现检测条件FI的方法极为有限，且由于数据需求不匹配，实践者迄今在方法应用上受到严重限制。大多数现实数据呈现复杂特征依赖性，同时包含连续型与分类型数据（混合数据）。而现有条件FI指标往往忽略这两类特性。为弥补这一空白，我们提出将条件预测影响（CPI）框架与序贯敲除采样相结合。CPI通过为待分析数据采样有效敲除变量——即生成具有相似统计特性的合成数据——实现可控制任何特征依赖性的条件FI测量。序贯敲除技术专为处理混合数据设计，因而使我们能够将CPI方法扩展至此类数据集。通过大量模拟实验与真实数据案例，我们证明所提出的工作流程能有效控制第一类错误、保持高统计功效，且结果与其他条件FI指标保持一致，而边缘FI指标则会导致误导性解释。本研究结果凸显了为混合数据开发统计充分且专门化方法的必要性。