Large datasets are often affected by cell-wise outliers in the form of missing or erroneous data. However, discarding any samples containing outliers may result in a dataset that is too small to accurately estimate the covariance matrix. Moreover, most robust procedures designed to address this problem are not effective on high-dimensional data as they rely crucially on invertibility of the covariance operator. In this paper, we propose an unbiased estimator for the covariance in the presence of missing values that does not require any imputation step and still achieves minimax statistical accuracy with the operator norm. We also advocate for its use in combination with cell-wise outlier detection methods to tackle cell-wise contamination in a high-dimensional and low-rank setting, where state-of-the-art methods may suffer from numerical instability and long computation times. To complement our theoretical findings, we conducted an experimental study which demonstrates the superiority of our approach over the state of the art both in low and high dimension settings.
翻译:大型数据集常因缺失或错误数据而受到单元格级离群值的影响。然而,丢弃任何包含离群值的样本可能导致数据集过小,无法准确估计协方差矩阵。此外,大多数针对该问题的稳健方法在高维数据中效果不佳,因为它们严重依赖协方差算子的可逆性。本文提出了一种在存在缺失值情况下的无偏协方差估计量,该估计量无需任何插补步骤,仍能在算子范数下达到极小极大统计精度。我们还主张将其与单元格级离群值检测方法结合使用,以应对高维低秩场景中的单元格污染问题——在该场景下,现有先进方法可能因数值不稳定和计算时间过长而受限。为补充理论结果,我们进行了实验研究,证明无论在低维还是高维场景中,我们的方法均优于现有先进方法。