We consider computationally-efficient estimation of population parameters when observations are subject to missing data. In particular, we consider estimation under the realizable contamination model of missing data in which an $ε$ fraction of the observations are subject to an arbitrary (and unknown) missing not at random (MNAR) mechanism. When the true data is Gaussian, we provide evidence towards statistical-computational gaps in several problems. For mean estimation in $\ell_2$ norm, we show that in order to obtain error at most $ρ$, for any constant contamination $ε\in (0, 1)$, (roughly) $n \gtrsim d e^{1/ρ^2}$ samples are necessary and that there is a computationally-inefficient algorithm which achieves this error. On the other hand, we show that any computationally-efficient method within certain popular families of algorithms requires a much larger sample complexity of (roughly) $n \gtrsim d^{1/ρ^2}$ and that there exists a polynomial time algorithm based on sum-of-squares which (nearly) achieves this lower bound. For covariance estimation in relative operator norm, we show that a parallel development holds. Finally, we turn to linear regression with missing observations and show that such a gap does not persist. Indeed, in this setting we show that minimizing a simple, strongly convex empirical risk nearly achieves the information-theoretic lower bound in polynomial time.
翻译:本文研究在观测数据存在缺失情况下的计算高效总体参数估计问题。具体而言,我们考虑在可实现污染缺失数据模型下的估计问题,其中$ε$比例的观测数据受到任意(且未知)的非随机缺失机制影响。当真实数据服从高斯分布时,我们为若干问题中的统计-计算间隙提供了证据。对于$\ell_2$范数下的均值估计,我们证明:为获得至多$ρ$的误差,对于任意常数污染率$ε\in (0, 1)$,需要(约)$n \gtrsim d e^{1/ρ^2}$的样本量,且存在一种计算低效算法能达到该误差界。另一方面,我们证明在某些主流算法族中,任何计算高效方法都需要(约)$n \gtrsim d^{1/ρ^2}$的更大样本复杂度,并存在基于平方和规划的多项式时间算法(近乎)达到该下界。对于相对算子范数下的协方差估计,我们证明了类似结论同样成立。最后,我们转向缺失观测下的线性回归问题,发现此类间隙并不持续存在。在该设定下,我们证明最小化一个简单的强凸经验风险函数即可在多项式时间内近乎达到信息论下界。