Originating from cooperative game theory, Shapley values have become one of the most widely used measures for variable importance in applied Machine Learning. However, the statistical understanding of Shapley values is still limited. In this paper, we take a nonparametric (or smoothing) perspective by introducing Shapley curves as a local measure of variable importance. We propose two estimation strategies and derive the consistency and asymptotic normality both under independence and dependence among the features. This allows us to construct confidence intervals and conduct inference on the estimated Shapley curves. We propose a novel version of the wild bootstrap procedure, specifically adjusted to give good finite sample coverage of the Shapley curves. The asymptotic results are validated in extensive experiments. In an empirical application, we analyze which attributes drive the prices of vehicles.
翻译:起源于合作博弈论的Shapley值已成为应用机器学习中最广泛使用的变量重要性度量之一。然而,对Shapley值的统计学理解仍然有限。本文从非参数(或平滑)视角出发,引入Shapley曲线作为变量重要性的局部度量。我们提出两种估计策略,并在特征独立和相关两种情况下推导出一致性和渐近正态性。这使我们能够构建置信区间并对估计的Shapley曲线进行推断。我们提出了一种改良版的wild bootstrap方法,专门调整以在有限样本下为Shapley曲线提供良好的覆盖性能。通过大量实验验证了渐近结果。在实证应用中,我们分析了哪些属性驱动车辆价格。