Detecting anomalies in large sets of observations is crucial in various applications, such as epidemiological studies, gene expression studies, and systems monitoring. We consider settings where the units of interest result in multiple independent observations from potentially distinct referentials. Scan statistics and related methods are commonly used in such settings, but rely on stringent modeling assumptions for proper calibration. We instead propose a rank-based variant of the higher criticism statistic that only requires independent observations originating from ordered spaces. We show under what conditions the resulting methodology is able to detect the presence of anomalies. These conditions are stated in a general, non-parametric manner, and depend solely on the probabilities of anomalous observations exceeding nominal observations. The analysis requires a refined understanding of the distribution of the ranks under the presence of anomalies, and in particular of the rank-induced dependencies. The methodology is robust against heavy-tailed distributions through the use of ranks. Within the exponential family and a family of convolutional models, we analytically quantify the asymptotic performance of our methodology and the performance of the oracle, and show the difference is small for many common models. Simulations confirm these results. We show the applicability of the methodology through an analysis of quality control data of a pharmaceutical manufacturing process.
翻译:在大规模观测数据中检测异常在流行病学研究、基因表达研究及系统监控等众多领域至关重要。我们考虑研究对象可能来自多个潜在不同参照系且产生独立观测值的场景。传统扫描统计量及相关方法虽常用于此类场景,但需依赖严格的建模假设进行校准。为此,我们提出基于秩的高阶批评统计量变体,仅需观测值来自有序空间并满足独立性条件。本文证明了该方法在何种条件下能够检测异常存在性,这些条件以非参数范式表述,仅取决于异常观测值超出正常观测值的概率。分析过程需深入理解异常存在时秩分布的特性,特别是秩诱导的依赖关系。由于采用秩方法,该技术对重尾分布具有鲁棒性。在指数族及一类卷积模型框架下,我们解析量化了本方法及理论最优方法的渐近性能,结果表明对于常见模型两者差异甚微。仿真实验验证了理论结果。通过制药生产过程中的质量控制数据分析,展示了该方法的应用价值。