We study the problem of robust distribution estimation under the Wasserstein metric, a popular discrepancy measure between probability distributions rooted in optimal transport (OT) theory. We introduce a new outlier-robust Wasserstein distance $\mathsf{W}_p^\varepsilon$ which allows for $\varepsilon$ outlier mass to be removed from its input distributions, and show that minimum distance estimation under $\mathsf{W}_p^\varepsilon$ achieves minimax optimal robust estimation risk. Our analysis is rooted in several new results for partial OT, including an approximate triangle inequality, which may be of independent interest. To address computational tractability, we derive a dual formulation for $\mathsf{W}_p^\varepsilon$ that adds a simple penalty term to the classic Kantorovich dual objective. As such, $\mathsf{W}_p^\varepsilon$ can be implemented via an elementary modification to standard, duality-based OT solvers. Our results are extended to sliced OT, where distributions are projected onto low-dimensional subspaces, and applications to homogeneity and independence testing are explored. We illustrate the virtues of our framework via applications to generative modeling with contaminated datasets.
翻译:本文研究在Wasserstein度量下的鲁棒分布估计问题,Wasserstein距离是一种源于最优传输(OT)理论的概率分布间流行的差异度量。我们引入了一种新的抗离群值Wasserstein距离$\mathsf{W}_p^\varepsilon$,该距离允许从其输入分布中移除$\varepsilon$比例的离群质量,并证明在$\mathsf{W}_p^\varepsilon$下的最小距离估计能够达到极小化最优的鲁棒估计风险。我们的分析基于部分最优传输的若干新结果(包括一个可能具有独立价值的近似三角不等式)。为解决计算可行性问题,我们推导了$\mathsf{W}_p^\varepsilon$的对偶形式,该形式在经典Kantorovich对偶目标中增加了一个简单的惩罚项。因此,$\mathsf{W}_p^\varepsilon$可通过标准基于对偶的OT求解器的基本修改来实现。我们将结果推广至切片OT场景(其中分布被投影到低维子空间),并探讨了在同质性检验和独立性检验中的应用。通过在含污染数据集上的生成建模应用,我们展示了该框架的优势。