Supervised learning with missing data aims at building the best prediction of a target output based on partially-observed inputs. Major approaches to address this problem can be decomposed into $(i)$ impute-then-predict strategies, which first fill in the empty input components and then apply a unique predictor and $(ii)$ Pattern-by-Pattern (P-b-P) approaches, where a predictor is built on each missing pattern. In this paper, we theoretically analyze how three classical linear classifiers, namely perceptron, logistic regression and linear discriminant analysis (LDA), behave with Missing Completely At Random (MCAR) data, depending on the strategy (imputation or P-b-P) used to handle missing values. We prove that both imputation and P-b-P approaches are ill-specified in a logistic regression framework, thus questioning the relevance of such approaches to handle missing data. The most favorable auspices to perform classification with missing data concern P-b-P LDA methods. We provide finite-sample bounds for the excess risk in this framework, even for high-dimensional or MNAR settings. Experiments illustrate our theoretical findings.
翻译:缺失数据监督学习旨在基于部分观测的输入构建对目标输出的最优预测。解决该问题的主要方法可分为两类:$(i)$ 先填补后预测策略——首先填充缺失的输入分量,然后应用统一的预测器;$(ii)$ 按缺失模式分类方法——为每个缺失模式单独构建预测器。本文从理论上分析了三种经典线性分类器(即感知机、逻辑回归和线性判别分析)在完全随机缺失数据下的表现,并比较了处理缺失值所采用的策略(填补法或按缺失模式分类法)。我们证明在逻辑回归框架下,填补法和按缺失模式分类法均存在模型设定偏误,从而质疑了此类方法处理缺失数据的适用性。在缺失数据分类任务中,最具前景的是按缺失模式分类的线性判别分析方法。我们为该框架下的超额风险提供了有限样本界,该结论甚至适用于高维或非随机缺失场景。实验部分验证了我们的理论发现。