A data analyst might worry about generalization if dropping a very small fraction of data points from a study could change its substantive conclusions. Finding the worst-case data subset to drop poses a combinatorial optimization problem. To overcome this intractability, recent works propose using additive approximations, which treat the contribution of a collection of data points as the sum of their individual contributions, and greedy approximations, which iteratively select the point with the highest impact to drop and re-run the data analysis without that point [Broderick et al., 2020, Kuschnig et al., 2021]. We identify that, even in a setting as simple as OLS linear regression, many of these approximations can break down in realistic data arrangements. Several of our examples reflect masking, where one outlier may hide or conceal the effect of another outlier. Based on the failures we identify, we provide recommendations for users and suggest directions for future improvements.
翻译:如果从研究中丢弃极少部分数据点就可能改变其核心结论,数据分析师可能会担忧泛化能力。寻找最坏情况的数据子集进行丢弃构成了一个组合优化问题。为克服这一计算难题,近期研究提出了使用加性近似(将一组数据点的贡献视为其个体贡献之和)和贪婪近似(迭代选择影响最大的点丢弃,并在移除该点后重新运行数据分析)的方法[Broderick et al., 2020, Kuschnig et al., 2021]。我们发现,即使在OLS线性回归这样简单的设定中,许多此类近似方法在实际数据分布下仍可能出现失效。我们给出的多个示例反映了掩蔽现象,即一个异常点可能掩盖或隐藏另一个异常点的影响。基于所识别的失效模式,我们为用户提供了使用建议,并指出了未来改进的方向。