It is well-known that trimmed sample means are robust against heavy tails and data contamination. This paper analyzes the performance of trimmed means and related methods in two novel contexts. The first one consists of estimating expectations of functions in a given family, with uniform error bounds; this is closely related to the problem of estimating the mean of a random vector under a general norm. The second problem considered is that of regression with quadratic loss. In both cases, trimmed-mean-based estimators are the first to obtain optimal dependence on the (adversarial) contamination level. Moreover, they also match or improve upon the state of the art in terms of heavy tails. Experiments with synthetic data show that a natural ``trimmed mean linear regression'' method often performs better than both ordinary least squares and alternative methods based on median-of-means.
翻译:众所周知,修剪样本均值对重尾分布和数据污染具有稳健性。本文分析了修剪均值及其相关方法在两个新情境中的表现。第一个情境涉及在给定函数族中估计函数的期望,并给出均匀误差界;这与在一般范数下估计随机向量均值的问题密切相关。第二个研究问题是二次损失下的回归问题。在这两种情形中,基于修剪均值的估计器首次实现了对(对抗性)污染水平的依赖性的最优性。此外,它们在重尾分布方面还匹配或改进了现有技术水平。合成数据实验表明,一种自然的“修剪均值线性回归”方法通常优于普通最小二乘法及基于均值中位数(median-of-means)的替代方法。