Mathematical Theory of Collinearity Effects on Machine Learning Variable Importance Measures

In many machine learning problems, understanding variable importance is a central concern. Two common approaches are Permute-and-Predict (PaP), which randomly permutes a feature in a validation set, and Leave-One-Covariate-Out (LOCO), which retrains models after permuting a training feature. Both methods deem a variable important if predictions with the original data substantially outperform those with permutations. In linear regression, empirical studies have linked PaP to regression coefficients and LOCO to $t$-statistics, but a formal theory has been lacking. We derive closed-form expressions for both measures, expressed using square-root transformations. PaP is shown to be proportional to the coefficient and predictor variability: $\text{PaP}_i = \beta_i \sqrt{2\operatorname{Var}(\mathbf{x}^v_i)}$, while LOCO is proportional to the coefficient but dampened by collinearity (captured by $\Delta$): $\text{LOCO}_i = \beta_i (1 -\Delta)\sqrt{1 + c}$. These derivations explain why PaP is largely unaffected by multicollinearity, whereas LOCO is highly sensitive to it. Monte Carlo simulations confirm these findings across varying levels of collinearity. Although derived for linear regression, we also show that these results provide reasonable approximations for models like Random Forests. Overall, this work establishes a theoretical basis for two widely used importance measures, helping analysts understand how they are affected by the true coefficients, dimension, and covariance structure. This work bridges empirical evidence and theory, enhancing the interpretability and application of variable importance measures.

翻译：在许多机器学习问题中，理解变量重要性是一个核心问题。两种常见方法是置换预测法（PaP）和留一协变量法（LOCO）：前者在验证集中随机置换特征，后者在置换训练特征后重新训练模型。若原始数据的预测效果显著优于置换后的预测，两种方法均判定该变量重要。在线性回归中，实证研究已将PaP与回归系数、LOCO与$t$统计量联系起来，但一直缺乏形式化理论。我们推导出两种度量的闭式表达式，均采用平方根变换表示。研究表明PaP与系数及预测变量变异性成正比：$\text{PaP}_i = \beta_i \sqrt{2\operatorname{Var}(\mathbf{x}^v_i)}$，而LOCO与系数成正比但受共线性（由$\Delta$刻画）抑制：$\text{LOCO}_i = \beta_i (1 -\Delta)\sqrt{1 + c}$。这些推导解释了为何PaP基本不受多重共线性影响，而LOCO对其高度敏感。蒙特卡洛模拟在不同共线性水平下验证了这些结论。虽然推导基于线性回归，但研究同时表明这些结果可为随机森林等模型提供合理近似。总体而言，本研究为两种广泛使用的重要性度量建立了理论基础，帮助分析者理解其如何受真实系数、维度和协方差结构的影响。该工作连接了实证证据与理论，增强了变量重要性度量的可解释性与应用性。