The sample covariance matrix is a cornerstone of multivariate statistics, but it is highly sensitive to outliers. These can be casewise outliers, such as cases belonging to a different population, or cellwise outliers, which are deviating cells (entries) of the data matrix. Recently some robust covariance estimators have been developed that can handle both types of outliers, but their computation is only feasible up to at most 20 dimensions. To remedy this we propose the cellRCov method, a robust covariance estimator that simultaneously handles casewise outliers, cellwise outliers, and missing data. It relies on a decomposition of the covariance on principal and orthogonal subspaces, leveraging recent work on robust PCA. It also employs a ridge-type regularization to stabilize the estimated covariance matrix. We establish some theoretical properties of cellRCov, including its casewise and cellwise influence functions as well as consistency and asymptotic normality. A simulation study demonstrates the superior performance of cellRCov in contaminated and missing data scenarios. Furthermore, its practical utility is illustrated in a real-world application to anomaly detection. We also construct and illustrate the cellRCCA method for robust and regularized canonical correlation analysis.
翻译:样本协方差矩阵是多变量统计学的基石,但极易受异常值影响。异常值可分为两类:个案异常值(如属于不同总体的样本)和单元格异常值(数据矩阵中的异常单元格)。近年来虽已开发出能处理两类异常值的稳健协方差估计量,但其计算复杂度仅支持最多20维数据。为解决此问题,我们提出cellRCov方法——一种能同时处理个案异常值、单元格异常值和缺失数据的稳健协方差估计量。该方法基于主子空间与正交子空间的协方差分解,借鉴了稳健主成分分析的最新成果,并采用岭型正则化稳定协方差矩阵估计。我们建立了cellRCov的部分理论性质,包括个案与单元格影响函数、一致性和渐近正态性。仿真研究表明,cellRCov在含污染值和缺失数据场景中具有优异表现。此外,通过异常检测的实际应用案例验证了其实用价值。我们还构建并展示了用于稳健正则化典型相关分析的cellRCCA方法。