Missing values have been thoroughly analyzed in the context of linear models, where the final aim is to build coefficient estimates. However, estimating coefficients does not directly solve the problem of prediction with missing entries: a manner to address empty components must be designed. Major approaches to deal with prediction with missing values are empirically driven and can be decomposed into two families: imputation (filling in empty fields) and pattern-by-pattern prediction, where a predictor is built on each missing pattern. Unfortunately, most simple imputation techniques used in practice (as constant imputation) are not consistent when combined with linear models. In this paper, we focus on the more flexible pattern-by-pattern approaches and study their predictive performances on Missing Completely At Random (MCAR) data. We first show that a pattern-by-pattern logistic regression model is intrinsically ill-defined, implying that even classical logistic regression is impossible to apply to missing data. We then analyze the perceptron model and show how the linear separability property extends to partially-observed inputs. Finally, we use the Linear Discriminant Analysis to prove that pattern-by-pattern LDA is consistent in a high-dimensional regime. We refine our analysis to more complex MNAR data.
翻译:缺失值在线性模型的背景下已得到深入分析,其最终目标是构建系数估计值。然而,估计系数并不能直接解决含缺失条目的预测问题:必须设计一种处理空值的方法。处理缺失值预测的主要方法基于经验驱动,可分为两类:插补(填充空值)和逐模式预测,即在每种缺失模式上构建预测器。遗憾的是,实践中使用的大多数简单插补技术(如常数插补)在与线性模型结合时并不一致。本文聚焦于更灵活的逐模式方法,研究其在完全随机缺失数据上的预测性能。我们首先证明逐模式逻辑回归模型本质上是病态的,这意味着经典逻辑回归甚至无法应用于缺失数据。随后分析感知机模型,展示线性可分性如何扩展至部分观测输入。最后利用线性判别分析证明,在高维情况下逐模式LDA具有一致性。我们进一步将分析扩展到更复杂的非随机缺失数据。