Missing values in datasets are common in applied statistics. For regression problems, theoretical work thus far has largely considered the issue of missing covariates as distinct from missing responses. However, in practice, many datasets have both forms of missingness. Motivated by this gap, we study linear regression with a labelled dataset containing missing covariates, potentially alongside an unlabelled dataset. We consider both structured (blockwise-missing) and unstructured missingness patterns, along with sparse and non-sparse regression parameters. For the non-sparse case, we provide an estimator based on imputing the missing data combined with a reweighting step. For the high-dimensional sparse case, we use a modified version of the Dantzig selector. We provide non-asymptotic upper bounds on the risk of both procedures. These are matched by several new minimax lower bounds, demonstrating the rate optimality of our estimators. Notably, even when the linear model is well-specified, our results characterise substantial differences in the minimax rates when unlabelled data is present relative to the fully supervised setting. Particular consequences of our sparse and non-sparse results include the first matching upper and lower bounds on the minimax rate for the supervised setting when either unstructured or structured missingness is present. Our theory is coupled with extensive simulations and a semi-synthetic application to the California housing dataset.
翻译:数据集中的缺失值在应用统计学中十分常见。对于回归问题,现有的理论研究大多将协变量缺失与响应变量缺失视为相互独立的问题。然而在实际应用中,许多数据集同时存在这两种缺失形式。基于这一研究空白,本文研究在标注数据集存在协变量缺失(可能同时存在未标注数据集)情况下的线性回归问题。我们同时考虑了结构化(块状缺失)与非结构化的缺失模式,以及稀疏与非稀疏的回归参数。针对非稀疏情形,我们提出了一种基于缺失数据填补与重加权步骤相结合的估计器。针对高维稀疏情形,我们采用了改进版的Dantzig选择器。我们为两种方法的风险提供了非渐近上界,并通过若干新的极小极大下界证明这些上界是匹配的,从而验证了估计器的速率最优性。值得注意的是,即使线性模型设定正确,我们的结果也揭示了存在未标注数据时极小极大速率与完全监督设定间的显著差异。特别地,我们的稀疏与非稀疏结果首次实现了在监督设定下(存在非结构化或结构化缺失时)极小极大速率上下界的匹配。理论分析辅以大量模拟实验,并在加利福尼亚住房数据集上进行了半合成应用验证。