This paper is devoted to the statistical and numerical properties of the geometric median, and its applications to the problem of robust mean estimation via the median of means principle. Our main theoretical results include (a) an upper bound for the distance between the mean and the median for general absolutely continuous distributions in R^d, and examples of specific classes of distributions for which these bounds do not depend on the ambient dimension d; (b) exponential deviation inequalities for the distance between the sample and the population versions of the geometric median, which again depend only on the trace-type quantities and not on the ambient dimension. As a corollary, we deduce improved bounds for the (geometric) median of means estimator that hold for large classes of heavy-tailed distributions. Finally, we address the error of numerical approximation, which is an important practical aspect of any statistical estimation procedure. We demonstrate that the objective function minimized by the geometric median satisfies a "local quadratic growth" condition that allows one to translate suboptimality bounds for the objective function to the corresponding bounds for the numerical approximation to the median itself, and propose a simple stopping rule applicable to any optimization method which yields explicit error guarantees. We conclude with the numerical experiments including the application to estimation of mean values of log-returns for S&P 500 data.
翻译:本文致力于研究几何中位数的统计与数值性质,以及通过中位数均值原则在鲁棒均值估计问题中的应用。我们的主要理论成果包括:(a)对于R^d空间中一般绝对连续分布,均值与中位数之间距离的上界,并给出了若干具体分布类中该界不依赖于环境维度d的实例;(b)样本几何中位数与总体几何中位数之间距离的指数偏差不等式,该不等式同样仅依赖于迹型量而非环境维度。作为推论,我们推导出适用于大类重尾分布的(几何)中位数均值估计器的改进界。最后,我们探讨了数值逼近误差这一统计估计过程中重要的实践问题。我们证明了几何中位数所最小化的目标函数满足"局部二次增长"条件,该条件允许将目标函数的次优性界转化为中位数本身数值逼近的相应界,并提出一种适用于任意优化方法的简单停步准则,可提供显式误差保证。我们通过数值实验(包括对标准普尔500指数对数收益率均值估计的应用)来总结全文。