This paper develops a flexible distribution-free method for collective outlier detection and enumeration, designed for situations in which the presence of outliers can be detected powerfully even though their precise identification may be challenging due to the sparsity, weakness, or elusiveness of their signals. This method builds upon recent developments in conformal inference and integrates classical ideas from other areas, including multiple testing, locally most powerful and adaptive rank tests, and non-parametric large-sample asymptotics. The key innovation lies in developing a principled and effective approach for automatically choosing the most appropriate machine learning classifier and two-sample testing procedure for a given data set. The performance of our method is investigated through extensive empirical demonstrations, including an analysis of the LHCO high-energy particle collision data set.
翻译:本文提出了一种灵活的、无需分布假设的集体离群点检测与枚举方法,适用于以下情形:即使离群点信号具有稀疏性、微弱性或难以捕捉的特性,导致其精确识别具有挑战性,但其存在仍能被有效检测。该方法基于校准推断的最新进展,并整合了来自其他领域的经典思想,包括多重检验、局部最有效与自适应秩检验,以及非参数大样本渐近理论。其核心创新在于,针对给定数据集,开发了一种原则性且有效的方法来自动选择最合适的机器学习分类器与双样本检验程序。我们通过广泛的实证演示(包括对LHCO高能粒子碰撞数据集的分析)研究了该方法的性能。