Small influential data subsets can dramatically impact model conclusions, with a few data points overturning key findings. While recent work identifies these most influential sets, there is no formal way to tell when maximum influence is excessive rather than expected under natural random sampling variation. We address this gap by developing a principled framework for most influential sets. Focusing on linear least-squares, we derive a convenient exact influence formula and identify the extreme value distributions of maximal influence - the heavy-tailed Fréchet for constant-size sets and heavy-tailed data, and the well-behaved Gumbel for growing sets or light tails. This allows us to conduct rigorous hypothesis tests for excessive influence. We demonstrate through applications across economics, biology, and machine learning benchmarks, resolving contested findings and replacing ad-hoc heuristics with rigorous inference.
翻译:小型有影响力的数据子集能显著影响模型结论,少数数据点即可推翻关键发现。尽管近期研究已能识别这些最具影响力的集合,但目前尚无正式方法来判断最大影响力何时属于异常现象而非自然随机抽样变异下的预期结果。针对这一空白,我们建立了最具影响力集合的理论框架。聚焦于线性最小二乘法,我们推导出便捷的精确影响力公式,并确定了最大影响力的极值分布——对于固定规模集合与重尾数据呈现重尾弗雷歇分布,而对于增长型集合或轻尾数据则呈现性质良好的冈贝尔分布。这使我们能够对异常影响力进行严格的假设检验。我们通过经济学、生物学和机器学习基准测试中的应用案例进行验证,解决了存在争议的研究发现,并以严格统计推断替代了临时启发性方法。