While machine learning (ML) models are increasingly used due to their high predictive power, their use in understanding the data-generating process (DGP) is limited. Understanding the DGP requires insights into feature-target associations, which many ML models cannot directly provide, due to their opaque internal mechanisms. Feature importance (FI) methods provide useful insights into the DGP under certain conditions. Since the results of different FI methods have different interpretations, selecting the correct FI method for a concrete use case is crucial and still requires expert knowledge. This paper serves as a comprehensive guide to help understand the different interpretations of FI methods. Through an extensive review of FI methods and providing new proofs regarding their interpretation, we facilitate a thorough understanding of these methods and formulate concrete recommendations for scientific inference. We conclude by discussing options for FI uncertainty estimation and point to directions for future research aiming at full statistical inference from black-box ML models.
翻译:尽管机器学习(ML)模型因其高预测性能而被日益广泛使用,但其在理解数据生成过程(DGP)方面的应用仍十分有限。理解DGP需要洞察特征与目标变量之间的关联性,而许多ML模型因其不透明的内部机制无法直接提供这些信息。特征重要性(FI)方法在特定条件下能够提供关于DGP的有价值见解。由于不同FI方法的结果具有不同的解释含义,针对具体应用场景选择正确的FI方法至关重要,这仍需要专家知识。本文作为一份综合性指南,旨在帮助理解不同FI方法的解读差异。通过对FI方法的广泛回顾及其解释提供新的证明,我们促进了对这些方法的深入理解,并针对科学推理提出了具体建议。最后,我们讨论了FI不确定性估计的可行方案,并指出了旨在从黑箱ML模型实现完整统计推断的未来研究方向。