We study the effects of missingness on the estimation of population parameters. Moving beyond restrictive missing completely at random (MCAR) assumptions, we first formulate a missing data analogue of Huber's arbitrary $ε$-contamination model. For mean estimation with respect to squared Euclidean error loss, we show that the minimax quantiles decompose as a sum of the corresponding minimax quantiles under a heterogeneous, MCAR assumption, and a robust error term, depending on $ε$, that reflects the additional error incurred by departure from MCAR. We next introduce natural classes of realisable $ε$-contamination models, where an MCAR version of a base distribution $P$ is contaminated by an arbitrary missing not at random (MNAR) version of $P$. These classes are rich enough to capture various notions of biased sampling and sensitivity conditions, yet we show that they enjoy improved minimax performance relative to our earlier arbitrary contamination classes for both parametric and nonparametric classes of base distributions. For instance, with a univariate Gaussian base distribution, consistent mean estimation over realisable $ε$-contamination classes is possible even when $ε$ and the proportion of missingness converge (slowly) to 1. We extend our results to the setting of departures from missing at random (MAR) in normal linear regression with a realisable missing response, and also demonstrate that our methods can be made adaptive to the case of unknown $ε$.
翻译:我们研究了缺失对总体参数估计的影响。首先,在突破严格缺失完全随机(MCAR)假设的基础上,我们构建了Huber任意ε-污染模型的缺失数据类比。对于平方欧几里得误差损失下的均值估计,我们发现极小化最大分位数可分解为异质性MCAR假设下相应极小化最大分位数之和,以及一个依赖于ε的稳健误差项,该误差项反映了偏离MCAR所导致的额外误差。接着,我们引入自然类可实现ε-污染模型,其中基础分布P的MCAR版本被P的任意缺失非随机(MNAR)版本污染。这些类足够丰富以捕捉各种有偏抽样和敏感性条件的概念,同时我们证明它们相对于先前的任意污染类,在参数和非参数基础分布类上均享有改进的极小化最大性能。例如,对于单变量高斯基础分布,即使ε和缺失比例(缓慢)收敛至1,在可实现ε-污染类上的均值一致估计仍是可能的。我们将结果扩展至正态线性回归中偏离缺失随机(MAR)的情形(含可实现缺失响应),并证明我们的方法可自适应于未知ε的情形。