Many modern datasets are collected automatically and are thus easily contaminated by outliers. This led to a regain of interest in robust estimation, including new notions of robustness such as robustness to adversarial contamination of the data. However, most robust estimation methods are designed for a specific model. Notably, many methods were proposed recently to obtain robust estimators in linear models (or generalized linear models), and a few were developed for very specific settings, for example beta regression or sample selection models. In this paper we develop a new approach for robust estimation in arbitrary regression models, based on Maximum Mean Discrepancy minimization. We build two estimators which are both proven to be robust to Huber-type contamination. We obtain a non-asymptotic error bound for one them and show that it is also robust to adversarial contamination, but this estimator is computationally more expensive to use in practice than the other one. As a by-product of our theoretical analysis of the proposed estimators we derive new results on kernel conditional mean embedding of distributions which are of independent interest.
翻译:许多现代数据集是自动收集的,因此容易受到异常值的污染。这导致了对鲁棒估计的重新关注,包括对数据对抗性污染等新型鲁棒性的研究。然而,大多数鲁棒估计方法都是针对特定模型设计的。值得注意的是,近年来提出了许多方法在线性模型(或广义线性模型)中获得鲁棒估计量,并且针对非常特定的设置(例如beta回归或样本选择模型)开发了一些方法。在本文中,我们基于最大均值差异最小化,提出了一种新的任意回归模型下的鲁棒估计方法。我们构建了两个估计量,均被证明对Huber型污染具有鲁棒性。我们为其中一个估计量获得了非渐近误差界,并表明它同时对对抗性污染具有鲁棒性,但该估计量在实践中计算成本高于另一个。作为所提估计量理论分析的副产品,我们推导出关于分布的核条件均值嵌入的新结果,这些结果具有独立的研究价值。