It is well-known that trimmed sample means are robust against heavy tails and data contamination. This paper analyzes the performance of trimmed means and related methods in two novel contexts. The first one consists of estimating expectations of functions in a given family, with uniform error bounds; this is closely related to the problem of estimating the mean of a random vector under a general norm. The second problem considered is that of regression with quadratic loss. In both cases, trimmed-mean-based estimators are the first to obtain optimal dependence on the (adversarial) contamination level. Moreover, they also match or improve upon the state of the art in terms of heavy tails. Experiments with synthetic data show that a natural ``trimmed mean linear regression'' method often performs better than both ordinary least squares and alternative methods based on median-of-means.
翻译:众所周知,修剪样本均值对重尾分布和数据污染具有鲁棒性。本文在两种新型背景下分析了修剪均值及相关方法的性能。第一种背景是估计给定函数族中各函数的期望,并给出均匀误差界;这与在一般范数下估计随机向量均值的问题密切相关。第二个考虑的问题是二次损失回归。在两种情况下,基于修剪均值的估计器是首个在(对抗性)污染水平上获得最优依赖性的方法。此外,它们在重尾分布方面的表现也达到或超越了现有技术水平。合成数据实验表明,一种自然的“修剪均值线性回归”方法在性能上常常优于普通最小二乘法以及基于均值中位数法的替代方法。