Feature-importance methods show promise in transforming machine learning models from predictive engines into tools for scientific discovery. However, due to data sampling and algorithmic stochasticity, expressive models can be unstable, leading to inaccurate variable importance estimates and undermining their utility in critical biomedical applications. Although ensembling offers a solution, deciding whether to explain a single ensemble model or aggregate individual model explanations is difficult due to the nonlinearity of importance measures and remains largely understudied. Our theoretical analysis, developed under assumptions accommodating complex state-of-the-art ML models, reveals that this choice is primarily driven by the model's excess risk. In contrast to prior literature, we show that ensembling at the model level provides more accurate variable-importance estimates, particularly for expressive models, by reducing this leading error term. We validate these findings on classical benchmarks and a large-scale proteomic study from the UK Biobank.
翻译:特征重要性方法在将机器学习模型从预测引擎转变为科学发现工具方面展现出潜力。然而,由于数据采样和算法随机性,表达能力强的模型可能不稳定,导致变量重要性估计不准确,从而削弱了其在关键生物医学应用中的效用。尽管集成方法提供了一种解决方案,但由于重要性度量的非线性特性,决定是解释单个集成模型还是聚合各个模型的解释仍然困难,且相关研究仍显不足。我们的理论分析在能够容纳复杂先进机器学习模型的假设下展开,揭示了这一选择主要由模型的超额风险驱动。与先前文献相反,我们表明在模型层面进行集成能够通过减少这一主导误差项,提供更准确的变量重要性估计,尤其对于表达能力强的模型。我们在经典基准测试和英国生物银行的大规模蛋白质组学研究中验证了这些发现。