We study case influence in the Lasso regression using Cook's distance which measures overall change in the fitted values when one observation is deleted. Unlike in ordinary least squares regression, the estimated coefficients in the Lasso do not have a closed form due to the nondifferentiability of the $\ell_1$ penalty, and neither does Cook's distance. To find the case-deleted Lasso solution without refitting the model, we approach it from the full data solution by introducing a weight parameter ranging from 1 to 0 and generating a solution path indexed by this parameter. We show that the solution path is piecewise linear with respect to a simple function of the weight parameter under a fixed penalty. The resulting case influence is a function of the penalty and weight, and it becomes Cook's distance when the weight is 0. As the penalty parameter changes, selected variables change, and the magnitude of Cook's distance for the same data point may vary with the subset of variables selected. In addition, we introduce a case influence graph to visualize how the contribution of each data point changes with the penalty parameter. From the graph, we can identify influential points at different penalty levels and make modeling decisions accordingly. Moreover, we find that case influence graphs exhibit different patterns between underfitting and overfitting phases, which can provide additional information for model selection.
翻译:本研究利用Cook距离探讨Lasso回归中的案例影响力,该距离度量了删除单个观测值时拟合值的整体变化。与普通最小二乘回归不同,由于$\ell_1$惩罚项的不可微性,Lasso的估计系数没有闭式解,Cook距离同样如此。为在不重新拟合模型的情况下获得删除案例的Lasso解,我们通过引入取值范围为1至0的权重参数,从全数据解出发构建以该参数为索引的解路径。我们证明在固定惩罚参数下,该解路径相对于权重参数的简单函数呈分段线性。所得案例影响力是惩罚参数与权重的函数,当权重为0时即转化为Cook距离。随着惩罚参数变化,被选变量会发生改变,同一数据点的Cook距离大小可能随所选变量子集的不同而变化。此外,我们引入案例影响力图谱来可视化各数据点贡献度随惩罚参数的变化情况。通过该图谱,可以识别不同惩罚水平下的影响点,并据此进行建模决策。进一步研究发现,案例影响力图谱在欠拟合与过拟合阶段呈现不同模式,这能为模型选择提供额外信息。