Large datasets are often affected by cell-wise outliers in the form of missing or erroneous data. However, discarding any samples containing outliers may result in a dataset that is too small to accurately estimate the covariance matrix. Moreover, most robust procedures designed to address this problem are not effective on high-dimensional data as they rely crucially on invertibility of the covariance operator. In this paper, we propose an unbiased estimator for the covariance in the presence of missing values that does not require any imputation step and still achieves minimax statistical accuracy with the operator norm. We also advocate for its use in combination with cell-wise outlier detection methods to tackle cell-wise contamination in a high-dimensional and low-rank setting, where state-of-the-art methods may suffer from numerical instability and long computation times. To complement our theoretical findings, we conducted an experimental study which demonstrates the superiority of our approach over the state of the art both in low and high dimension settings.
翻译:大规模数据集常受限于以缺失或错误数据形式出现的单元级离群值。然而,丢弃任何包含离群值的样本可能导致数据集过小,无法准确估计协方差矩阵。此外,专为处理该问题设计的大多数稳健方法在处理高维数据时效果不佳,因为它们严重依赖于协方差算子的可逆性。本文针对存在缺失值的情况提出了一种无偏协方差估计方法,该方法无需任何插补步骤,仍能在算子范数下达到极小极大统计精度。我们还倡导将其与单元级离群检测方法结合使用,以应对高维低秩场景中的单元级污染——现有先进方法在此类场景下可能存在数值不稳定和计算时间较长的问题。为补充理论发现,我们开展了实验研究,结果表明不管在低维还是高维场景下,我们的方法均优于现有先进方法。