Covariance matrix estimation is an important problem in multivariate data analysis, both from theoretical as well as applied points of view. Many simple and popular covariance matrix estimators are known to be severely affected by model misspecification and the presence of outliers in the data; on the other hand robust estimators with reasonably high efficiency are often computationally challenging for modern large and complex datasets. In this work, we propose a new, simple, robust and highly efficient method for estimation of the location vector and the scatter matrix for elliptically symmetric distributions. The proposed estimation procedure is designed in the spirit of the minimum density power divergence (DPD) estimation approach with appropriate modifications which makes our proposal (sequential minimum DPD estimation) computationally very economical and scalable to large as well as higher dimensional datasets. Consistency and asymptotic normality of the proposed sequential estimators of the multivariate location and scatter are established along with asymptotic positive definiteness of the estimated scatter matrix. Robustness of our estimators are studied by means of influence functions. All theoretical results are illustrated further under multivariate normality. A large-scale simulation study is presented to assess finite sample performances and scalability of our method in comparison to the usual maximum likelihood estimator (MLE), the ordinary minimum DPD estimator (MDPDE) and other popular non-parametric methods. The applicability of our method is further illustrated with a real dataset on credit card transactions.
翻译:协方差矩阵估计是多元数据分析中的一个重要问题,无论从理论还是应用角度皆然。已知许多简单且常用的协方差矩阵估计量会严重受到模型误设和数据中异常值的影响;另一方面,具有合理高效性的稳健估计方法对于现代大规模复杂数据集往往存在计算上的挑战。本研究针对椭圆对称分布提出了一种新颖、简单、稳健且高效的均值向量与散度矩阵估计方法。所提出的估计程序基于最小密度幂散度(DPD)估计思想进行设计,并经过适当改进,使得我们的方法(序列最小DPD估计)在计算上非常经济,且能适应大规模及高维数据集。我们建立了多元位置与散度序列估计量的一致性和渐近正态性,同时证明了估计散度矩阵的渐近正定性。通过影响函数研究了估计量的稳健性。所有理论结果均在多元正态性假设下得到进一步阐释。通过大规模模拟研究,将我们的方法与常规最大似然估计(MLE)、普通最小DPD估计(MDPDE)及其他常用非参数方法进行比较,评估了有限样本性能及可扩展性。最后通过信用卡交易的真实数据集进一步展示了本方法的适用性。