Distribution shifts are a serious concern in modern statistical learning as they can systematically change the properties of the data away from the truth. We focus on Wasserstein distribution shifts, where every data point may undergo a slight perturbation, as opposed to the Huber contamination model where a fraction of observations are outliers. We formulate and study shifts beyond independent perturbations, exploring Joint Distribution Shifts, where the per-observation perturbations can be coordinated. We analyze several important statistical problems, including location estimation, linear regression, and non-parametric density estimation. Under a squared loss for mean estimation and prediction error in linear regression, we find the exact minimax risk, a least favorable perturbation, and show that the sample mean and least squares estimators are respectively optimal. This holds for both independent and joint shifts, but the least favorable perturbations and minimax risks differ. For other problems, we provide nearly optimal estimators and precise finite-sample bounds. We also introduce several tools for bounding the minimax risk under distribution shift, such as a smoothing technique for location families, and generalizations of classical tools including least favorable sequences of priors, the modulus of continuity, Le Cam's, Fano's, and Assouad's methods.
翻译:分布偏移是现代统计学习中的严峻挑战,它会系统性改变数据相对于真实情况的属性。本文聚焦于Wasserstein分布偏移——每个数据点可能经历轻微扰动,区别于Huber污染模型中部分观测值为离群点的情形。我们研究超越独立扰动的偏移场景,探索联合分布偏移,即各观测扰动可相互协调。针对位置估计、线性回归及非参数密度估计等重要统计问题展开分析。在均值估计的平方损失与线性回归的预测误差准则下,我们推导出精确的极小极大风险与最不利扰动,并证明样本均值与最小二乘估计量分别达到最优。这一结论对独立偏移与联合偏移均成立,但两者最不利扰动与极小极大风险存在差异。对于其他问题,我们提出近最优估计量与精确有限样本界。同时引入多种在分布偏移下界定极小极大风险的工具,包括针对位置族的平滑技术,以及经典方法的推广——如最不利先验序列、连续模量、Le Cam方法、Fano方法与Assouad方法。