Large datasets are often affected by cell-wise outliers in the form of missing or erroneous data. However, discarding any samples containing outliers may result in a dataset that is too small to accurately estimate the covariance matrix. Moreover, the robust procedures designed to address this problem require the invertibility of the covariance operator and thus are not effective on high-dimensional data. In this paper, we propose an unbiased estimator for the covariance in the presence of missing values that does not require any imputation step and still achieves near minimax statistical accuracy with the operator norm. We also advocate for its use in combination with cell-wise outlier detection methods to tackle cell-wise contamination in a high-dimensional and low-rank setting, where state-of-the-art methods may suffer from numerical instability and long computation times. To complement our theoretical findings, we conducted an experimental study which demonstrates the superiority of our approach over the state of the art both in low and high dimension settings.
翻译:大规模数据集常常受到缺失数据或错误数据形式的逐元素异常值影响。然而,剔除任何含有异常值的样本可能导致数据集过小,从而无法准确估计协方差矩阵。此外,为解决该问题而设计的稳健方法要求协方差算子可逆,因此在高维数据上效果不佳。本文提出了一种在存在缺失值情况下的协方差无偏估计量,该估计量无需任何插补步骤,并且能在算子范数下达到接近极小极大统计精度。我们还倡导将其与逐元素异常值检测方法结合使用,以应对高维低秩场景中的逐元素污染问题,而在此场景下,现有最先进方法可能面临数值不稳定性和长计算时间的困扰。为补充理论发现,我们进行了实验研究,结果表明我们的方法在低维和高维设置下均优于现有最先进方法。