This paper introduces several enhancements to the minimum covariance determinant method of outlier detection and robust estimation of means and covariances. We leverage the principal component transform to achieve dimension reduction and ultimately better analyses. Our best subset selection algorithm strategically combines statistical depth and concentration steps. To ascertain the appropriate subset size and number of principal components, we introduce a bootstrap procedure that estimates the instability of the best subset algorithm. The parameter combination exhibiting minimal instability proves ideal for the purposes of outlier detection and robust estimation. Rigorous benchmarking against prominent MCD variants showcases our approach's superior statistical performance and computational speed in high dimensions. Application to a fruit spectra data set and a cancer genomics data set illustrates our claims.
翻译:本文针对异常值检测及均值与协方差鲁棒估计中的最小协方差行列式方法进行了多项改进。我们利用主成分变换实现维度约简,进而获得更优分析效果。所提出的最优子集选择算法通过策略性组合统计深度与浓度步骤实现。为确定合适的子集规模及主成分数量,我们引入自举程序以估计最优子集算法的不稳定性。呈现最小不稳定性的参数组合被证明在异常值检测与鲁棒估计中最优。通过与主流MCD变体的严格基准测试表明,本方法在高维数据中具有更优的统计性能与计算速度。基于水果光谱数据集与癌症基因组数据集的实证分析验证了本文主张。