The Wasserstein distance, rooted in optimal transport (OT) theory, is a popular discrepancy measure between probability distributions with various applications to statistics and machine learning. Despite their rich structure and demonstrated utility, Wasserstein distances are sensitive to outliers in the considered distributions, which hinders applicability in practice. We propose a new outlier-robust Wasserstein distance $\mathsf{W}_p^\varepsilon$ which allows for $\varepsilon$ outlier mass to be removed from each contaminated distribution. Under standard moment assumptions, $\mathsf{W}_p^\varepsilon$ is shown to achieve strong robust estimation guarantees under the Huber $\varepsilon$-contamination model. Our formulation of this robust distance amounts to a highly regular optimization problem that lends itself better for analysis compared to previously considered frameworks. Leveraging this, we conduct a thorough theoretical study of $\mathsf{W}_p^\varepsilon$, encompassing robustness guarantees, characterization of optimal perturbations, regularity, duality, and statistical estimation. In particular, by decoupling the optimization variables, we arrive at a simple dual form for $\mathsf{W}_p^\varepsilon$ that can be implemented via an elementary modification to standard, duality-based OT solvers. We illustrate the virtues of our framework via applications to generative modeling with contaminated datasets.
翻译:Wasserstein距离源于优化输运(OT)理论,是一种流行的概率分布间差异度量方法,在统计学和机器学习中具有广泛应用。尽管其结构丰富且实用性已被验证,但Wasserstein距离对分布中的异常值敏感,这阻碍了其在实际中的应用。我们提出了一种新的抗异常值Wasserstein距离$\mathsf{W}_p^\varepsilon$,该距离允许从每个受污染分布中移除$\varepsilon$比例的异常质量。在标准矩假设下,我们证明$\mathsf{W}_p^\varepsilon$在Huber $\varepsilon$-污染模型下能够实现强大的稳健估计保证。我们的稳健距离公式对应于一个高度正则化的优化问题,相较于先前考虑的框架更易于分析。基于此,我们对$\mathsf{W}_p^\varepsilon$进行了全面的理论研究,包括稳健性保证、最优扰动的刻画、正则性、对偶性及统计估计。特别是,通过解耦优化变量,我们得到了$\mathsf{W}_p^\varepsilon$的简洁对偶形式,该形式可通过简单修改基于对偶性的标准OT求解器实现。我们通过受污染数据集上的生成建模应用,展示了该框架的优势。