We present a novel 2D convex hull peeling algorithm for outlier detection, which repeatedly removes the point on the hull that decreases the hull's area the most. To find k outliers among n points, one simply peels k points. The algorithm is an efficient heuristic for exact methods, which find the k points whose removal together results in the smallest convex hull. Our algorithm runs in O(nlogn) time using O(n) space for any choice of k. This is a significant speedup compared to the fastest exact algorithms, which run in O(n^2logn + (n - k)^3) time using O(n\logn + (n-k)^3) space by Eppstein et al., and O(nlogn + 4k_C_2k (3k)^k n) time by Atanassov et al. Existing heuristic peeling approaches are not area-based. Instead, an approach by Harsh et al. repeatedly removes the point furthest from the mean using various distance metrics and runs in O(nlogn + kn) time. Other approaches greedily peel one convex layer at a time, which is efficient when using an O(nlogn) time algorithm by Chazelle to compute the convex layers. However, in many cases this fails to recover outliers. For most values of n and k, our approach is the fastest and first practical choice for finding outliers based on minimizing the area of the convex hull. Our algorithm also generalizes to other objectives such as perimeter.
翻译:我们提出了一种用于异常值检测的新型二维凸包剥离算法,该算法反复移除使凸包面积减少最多的边界点。要在n个点中找出k个异常值,只需剥离k个点。该算法是精确方法的高效启发式替代方案,精确方法需要找到同时移除后能使凸包面积最小的k个点。我们的算法对于任意k值选择,均能在O(nlogn)时间内以O(n)空间复杂度运行。相比最快的精确算法,这实现了显著的加速:Eppstein等人提出的算法需要O(n²logn + (n - k)³)时间和O(nlogn + (n-k)³)空间,Atanassov等人提出的算法需要O(nlogn + 4kC2k (3k)^k n)时间。现有启发式剥离方法并非基于面积计算:Harsh等人提出的方法使用不同距离度量反复移除距离均值最远的点,时间复杂度为O(nlogn + kn);其他方法则采用Chazelle提出的O(nlogn)时间凸层计算算法逐层贪婪剥离,但在多数情况下无法有效识别异常值。对于大多数n和k的取值,我们的方法在基于最小化凸包面积的异常值检测任务中,是当前最快且首个实用的选择。该算法还可推广至周长等其他优化目标。