Robust correlation analysis is among the most critical challenges in statistics. Herein, we develop an efficient algorithm for selecting the $k$- subset of $n$ points in the plane with the highest coefficient of determination $\left( R^2 \right)$. Drawing from combinatorial geometry, we propose a method called the \textit{quadratic sweep} that consists of two steps: (i) projectively lifting the data points into $\mathbb R^5$ and then (ii) iterating over each linearly separable $k$-subset. Its basis is that the optimal set of outliers is separable from its complement in $\mathbb R^2$ by a conic section, which, in $\mathbb R^5$, can be found by a topological sweep in $\Theta \left( n^5 \log n \right)$ time. Although key proofs of quadratic separability remain underway, we develop strong mathematical intuitions for our conjectures, then experimentally demonstrate our method's optimality over several million trials up to $n=30$ without error. Implementations in Julia and fully seeded, reproducible experiments are available at https://github.com/marc-harary/QuadraticSweep.
翻译:鲁棒相关性分析是统计学中最关键的挑战之一。本文提出了一种高效算法,用于在平面上的$n$个点中选择具有最高确定性系数$\left( R^2 \right)$的$k$子集。借鉴组合几何学思想,我们提出了一种称为\textit{二次扫描}的方法,包含两个步骤:(i) 将数据点投影提升至$\mathbb R^5$空间;(ii) 遍历每个线性可分的$k$子集。该方法的基础在于:最优异常值集合在$\mathbb R^2$中可通过圆锥曲线与其补集分离,而在$\mathbb R^5$中可通过拓扑扫描在$\Theta \left( n^5 \log n \right)$时间内找到。虽然二次可分性的关键证明仍在进行中,但我们为猜想建立了坚实的数学直觉,并通过实验在$n=30$范围内数百万次试验中验证了方法的无误差最优性。Julia语言实现及完全可复现的实验代码详见https://github.com/marc-harary/QuadraticSweep。