This paper is devoted to the statistical and numerical properties of the geometric median, and its applications to the problem of robust mean estimation via the median of means principle. Our main theoretical results include (a) an upper bound for the distance between the mean and the median for general absolutely continuous distributions in R^d, and examples of specific classes of distributions for which these bounds do not depend on the ambient dimension $d$; (b) exponential deviation inequalities for the distance between the sample and the population versions of the geometric median, which again depend only on the trace-type quantities and not on the ambient dimension. As a corollary, we deduce improved bounds for the (geometric) median of means estimator that hold for large classes of heavy-tailed distributions. Finally, we address the error of numerical approximation, which is an important practical aspect of any statistical estimation procedure. We demonstrate that the objective function minimized by the geometric median satisfies a "local quadratic growth" condition that allows one to translate suboptimality bounds for the objective function to the corresponding bounds for the numerical approximation to the median itself. As a corollary, we propose a simple stopping rule (applicable to any optimization method) which yields explicit error guarantees. We conclude with the numerical experiments including the application to estimation of mean values of log-returns for S&P 500 data.
翻译:本文系统研究了几何中位数的统计与数值性质,以及基于中位数均值原则在鲁棒均值估计问题中的应用。主要理论成果包括:(a) 对于Rd中一般绝对连续分布,给出了均值与中位数之间距离的上界,并给出若干分布类中该上界不依赖于环境维数d的具体示例;(b) 建立了样本几何中位数与总体几何中位数之间距离的指数偏差不等式,这些不等式同样仅依赖于迹型量而非环境维数。作为推论,我们推导出适用于大范围重尾分布类的(几何)中位数均值估计量的改进界。最后,我们探讨了数值逼近误差——这是任何统计估计程序中重要的实际考量因素。我们证明几何中位数最小化的目标函数满足"局部二次增长"条件,该条件允许将目标函数的次优性界转化为中位数本身数值逼近的相应界。据此,我们提出一个简单的停止准则(适用于任意优化方法),可提供明确的误差保证。我们通过数值实验进行验证,包括标准普尔500指数对数收益率均值估计的实际应用。