Explaining the decision process of machine learning algorithms is nowadays crucial for both model's performance enhancement and human comprehension. This can be achieved by assessing the variable importance of single variables, even for high-capacity non-linear methods, e.g. Deep Neural Networks (DNNs). While only removal-based approaches, such as Permutation Importance (PI), can bring statistical validity, they return misleading results when variables are correlated. Conditional Permutation Importance (CPI) bypasses PI's limitations in such cases. However, in high-dimensional settings, where high correlations between the variables cancel their conditional importance, the use of CPI as well as other methods leads to unreliable results, besides prohibitive computation costs. Grouping variables statistically via clustering or some prior knowledge gains some power back and leads to better interpretations. In this work, we introduce BCPI (Block-Based Conditional Permutation Importance), a new generic framework for variable importance computation with statistical guarantees handling both single and group cases. Furthermore, as handling groups with high cardinality (such as a set of observations of a given modality) are both time-consuming and resource-intensive, we also introduce a new stacking approach extending the DNN architecture with sub-linear layers adapted to the group structure. We show that the ensuing approach extended with stacking controls the type-I error even with highly-correlated groups and shows top accuracy across benchmarks. Furthermore, we perform a real-world data analysis in a large-scale medical dataset where we aim to show the consistency between our results and the literature for a biomarker prediction.
翻译:解释机器学习算法的决策过程如今对提升模型性能及人类理解至关重要。这可通过评估单一变量的变量重要性实现,即使对于深度神经网络等高容量非线性方法也是如此。虽然仅基于移除的方法(如排列重要性)能提供统计有效性,但当变量相互关联时,此类方法会返回误导性结果。条件排列重要性(CPI)可规避排列重要性在此类场景下的局限性。然而,在高维设定中(变量间高度相关性会抵消其条件重要性),使用CPI及其他方法不仅计算成本高昂,还会导致不可靠结果。通过聚类或先验知识对变量进行统计分组,可恢复部分统计效力并实现更优解释。本研究提出BCPI(基于块的条件排列重要性)——一种新的通用变量重要性计算框架,具备同时处理单变量与分组变量的统计保证。此外,针对高基数分组(例如特定模态的观测值集合)耗时且资源密集的问题,我们引入了一种新的堆叠方法,通过适配分组结构的次线性层扩展深度神经网络架构。实验表明,经堆叠扩展的后续方法即使在高度相关分组下也能控制第一类错误,并在多个基准测试中展现顶尖精度。最后,我们在大规模医学数据集上开展真实世界数据分析,旨在证明我们结果与文献中生物标志物预测的一致性。