Robust estimation of the covariance matrix and detection of outliers remain major challenges in statistical data analysis, particularly when the proportion of contaminated observations increases with the size of the dataset. Outliers can severely bias parameter estimates and induce a masking effect, whereby some outliers conceal the presence of other outliers, further complicating their detection. Although many approaches have been proposed for covariance estimation and outlier detection, to our knowledge, none of these methods have been implemented in an online setting. In this paper, we focus on online covariance matrix estimation and outlier detection. Specifically, we propose a new method for simultaneously and online estimating the geometric median and variance, which allows us to calculate the Mahalanobis distance for each incoming data point before deciding whether it should be considered an outlier. To mitigate the masking effect, robust estimation techniques for the mean and variance are required. Our approach uses the geometric median for robust estimation of the location and the median covariance matrix for robust estimation of the dispersion parameters. The new online methods proposed for parameter estimation and outlier detection allow real-time identification of outliers as data are observed sequentially. The performance of our methods is demonstrated on simulated datasets.
翻译:协方差矩阵的鲁棒估计与异常值检测仍然是统计数据分析中的主要挑战,尤其当受污染观测的比例随数据集规模增加而增加时。异常值可能严重偏倚参数估计并引发掩蔽效应,即某些异常值掩盖其他异常值的出现,从而进一步加剧检测的复杂性。尽管已有多种协方差估计与异常值检测方法被提出,但据我们所知,尚无方法实现在线场景下的应用。本文聚焦于在线协方差矩阵估计与异常值检测。具体而言,我们提出一种同时在线估计几何中位数与方差的新方法,该方法能够在判定每个新到达数据点是否为异常值之前,为其计算马氏距离。为缓解掩蔽效应,需要对均值与方差进行鲁棒估计。我们的方法采用几何中位数实现位置参数的鲁棒估计,并采用中位数协方差矩阵实现离散参数的鲁棒估计。所提出的参数估计与异常值检测在线新方法能够在数据顺序观测时实时识别异常值。我们在模拟数据集上验证了所提方法的性能。