This paper develops a flexible distribution-free method for collective outlier detection and enumeration, designed for situations in which the presence of outliers can be detected powerfully even though their precise identification may be challenging due to the sparsity, weakness, or elusiveness of their signals. This method builds upon recent developments in conformal inference and integrates classical ideas from other areas, including multiple testing, rank tests, and non-parametric large-sample asymptotics. The key innovation lies in developing a principled and effective approach for automatically choosing the most appropriate machine learning classifier and two-sample testing procedure for a given data set. The performance of our method is investigated through extensive empirical demonstrations, including an analysis of the LHCO high-energy particle collision data set.
翻译:本文提出了一种灵活的无分布集体离群点检测与枚举方法,适用于以下场景:即使离群点的信号具有稀疏性、微弱性或难以捕捉的特性,导致其精确识别存在挑战,但其存在性仍可被有效检测。该方法建立在近期校准推断研究进展的基础上,并融合了来自多重检验、秩检验及非参数大样本渐近理论等其他领域的经典思想。其核心创新在于提出了一种原则性且高效的框架,能够针对给定数据集自动选择最合适的机器学习分类器与双样本检验程序。我们通过大量实证研究评估了该方法的性能,包括对LHCO高能粒子碰撞数据集的分析。