A data analyst might worry about generalization if dropping a very small fraction of data points from a study could change its substantive conclusions. Finding the worst-case data subset to drop poses a combinatorial optimization problem. To overcome this intractability, recent works propose using additive approximations, which treat the contribution of a collection of data points as the sum of their individual contributions, and greedy approximations, which iteratively select the point with the highest impact to drop and re-run the data analysis without that point [Broderick et al., 2020, Kuschnig et al., 2021]. We identify that, even in a setting as simple as OLS linear regression, many of these approximations can break down in realistic data arrangements. Several of our examples reflect masking, where one outlier may hide or conceal the effect of another outlier. Based on the failures we identify, we provide recommendations for users and suggest directions for future improvements.
翻译:如果从一项研究中删除极少量的数据点就可能改变其核心结论,数据分析师可能会担忧模型的泛化能力。寻找最坏情况下的待删除数据子集构成了一个组合优化问题。为克服该问题的难解性,近期研究提出了两种近似方法:加法近似(将一组数据点的贡献视为其个体贡献之和)与贪心近似(迭代选择影响最大的点进行删除,并在删除该点后重新运行数据分析)[Broderick et al., 2020, Kuschnig et al., 2021]。我们发现,即使在OLS线性回归这样简单的设定中,这些近似方法在现实数据分布下仍可能出现多种失效情况。我们列举的若干案例反映了掩蔽现象——即一个异常点可能掩盖或隐藏另一个异常点的影响。基于所识别的失效模式,我们为用户提供了使用建议,并指出了未来改进的方向。