Automatic detection of statistical outliers is facilitated through knowledge of the source distribution of regular observations. Since the population distribution is often unknown in practice, one approach is to apply a transformation to Normality. However, the efficacy of transformation is hindered by the presence of outliers, which can have an outsized influence on transformation parameter(s) and lead to masking of outliers post-transformation. Robust Box-Cox and Yeo-Johnson transformations have been proposed but those transformations are only equipped to deal with skew. Here, we develop a novel robust method for transformation to Normality based on the highly flexible sinh-arcsinh (SHASH) family of distributions, which can accommodate skew, non-Gaussian tail weights, and combinations of both. A critical step is initializing outliers, given their potential influence on the highly flexible SHASH transformation. To this end, we consider conventional robust z-scoring and a novel anomaly detection approach. Through extensive simulation studies and real data analyses representing a wide variety of distribution shapes, we find that SHASH transformation outperforms existing methods, exhibiting high sensitivity to outliers even at heavy contamination levels (20-30\%). We illustrate the utility of SHASH transformation-based outlier detection in the context of noise reduction in functional neuroimaging data.
翻译:自动检测统计异常值通常依赖于对常规观测值源分布的了解。由于实际应用中总体分布往往未知,一种方法是应用正态性变换。然而,变换效果常受异常值影响,这些异常值可能对变换参数产生过度影响,导致变换后异常值被掩盖。已有研究提出稳健的Box-Cox和Yeo-Johnson变换,但这些变换仅能处理偏态问题。本文基于高度灵活的sinh-arcsinh(SHASH)分布族,开发了一种新颖的稳健正态性变换方法,该方法能够同时适应偏态、非高斯尾重及其组合分布。鉴于异常值对高度灵活的SHASH变换可能产生的影响,关键步骤在于异常值的初始化处理。为此,我们考虑了传统的稳健z值标准化方法和一种新颖的异常检测方法。通过涵盖多种分布形态的广泛模拟研究和实际数据分析,我们发现SHASH变换在性能上优于现有方法,即使在重度污染水平(20-30%)下仍对异常值保持高敏感性。我们在功能神经影像数据降噪的背景下,展示了基于SHASH变换的异常值检测方法的实用性。