Variable importance in regression analyses is of considerable interest in a variety of fields. There is no unique method for assessing variable importance. However, a substantial share of the available literature employs Shapley values, either explicitly or implicitly, to decompose a suitable goodness-of-fit measure, in the linear regression model typically the classical $R^2$. Beyond linear regression, there is no generally accepted goodness-of-fit measure, only a variety of pseudo-$R^2$s. We formulate and discuss the desirable properties of goodness-of-fit measures that enable Shapley values to be interpreted in terms of relative, and even absolute, importance. We suggest to use a pseudo-$R^2$ based on the Kullback-Leibler divergence, the Kullback-Leibler $R^2$, which has a convenient form for generalized linear models and permits to unify and extend previous work on variable importance for linear and nonlinear models. Several examples are presented, using data from public health and insurance.
翻译:回归分析中的变量重要性在众多领域具有重要研究价值。目前尚不存在评估变量重要性的唯一方法。然而,现有文献中相当一部分研究(无论是显式或隐式地)采用Shapley值来分解适当的拟合优度度量指标——在线性回归模型中通常使用经典的$R^2$。对于线性回归之外的模型,目前尚无公认的拟合优度度量标准,仅存在多种伪$R^2$指标。本文系统阐述并讨论了使Shapley值能够解释相对重要性乃至绝对重要性的拟合优度度量指标应具备的理想性质。我们建议采用基于Kullback-Leibler散度的伪$R^2$指标——Kullback-Leibler $R^2$,该指标在广义线性模型中具有简洁的表达形式,能够统一并拓展先前关于线性和非线性模型变量重要性的研究工作。本文通过公共卫生和保险领域的实际数据展示了若干应用案例。