We investigate practical algorithms to find or disprove the existence of small subsets of a dataset which, when removed, reverse the sign of a coefficient in an ordinary least squares regression involving that dataset. We empirically study the performance of well-established algorithmic techniques for this task -- mixed integer quadratically constrained optimization for general linear regression problems and exact greedy methods for special cases. We show that these methods largely outperform the state of the art and provide a useful robustness check for regression problems in a few dimensions. However, significant computational bottlenecks remain, especially for the important task of disproving the existence of such small sets of influential samples for regression problems of dimension $3$ or greater. We make some headway on this challenge via a spectral algorithm using ideas drawn from recent innovations in algorithmic robust statistics. We summarize the limitations of known techniques in several challenge datasets to encourage further algorithmic innovation.
翻译:我们研究用于发现或证伪数据集中存在微小子集的实用算法,这些子集被移除后会导致该数据集参与的普通最小二乘回归中的系数符号反转。我们通过实验研究针对该任务的成熟算法技术性能:通用线性回归问题的混合整数二次约束优化方法,以及特殊情形下的精确贪心算法。结果表明,这些方法显著优于现有技术,并为低维回归问题提供了有效的鲁棒性检验。然而,对于维度≥3的回归问题,特别是在证伪具有影响力的微小样本集合是否存在这一重要任务中,仍存在显著的计算瓶颈。我们借鉴算法鲁棒统计领域的最新进展,通过谱算法在这一挑战上取得一定突破。我们总结了多个挑战性数据集中已知技术的局限性,以期推动进一步的算法创新。