Predicting with missing inputs challenges even parametric models, as parameter estimation alone is insufficient for prediction on incomplete data. While several works study prediction in linear models, we focus on logistic models, where optimal predictors lack closed-form expressions. We prove that a Pattern-by-Pattern strategy (PbP), which learns one logistic model per missingness pattern, accurately approximates Bayes probabilities under a Gaussian Pattern Mixture Model (GPMM). Crucially, this result holds across standard missing data scenarios (MCAR and MAR) and, notably, in Missing Not at Random (MNAR) settings where standard methods often fail. Empirically, we compare PbP against imputation and EM methods across classification, probability estimation, calibration, and inference. Our analysis provides a comprehensive view of logistic regression with missing values. It reveals that mean imputation can be used as baseline for low sample sizes and PbP for large sample sizes, as both methods are fast to train and may have good performances in some settings. The best performances are achieved by non-linear multiple iterative imputation techniques that include the response label (Random Forest MICE with response), which are more computationally expensive.
翻译:在缺失输入条件下进行预测对参数模型构成挑战,因为仅凭参数估计不足以对不完整数据进行预测。尽管已有若干研究探讨线性模型中的预测问题,本文聚焦于逻辑模型——其最优预测器缺乏闭式解析表达式。我们证明,在服从高斯模式混合模型(GPMM)的假设下,采用逐模式学习策略(PbP,即为每个缺失模式训练一个逻辑模型)能够准确逼近贝叶斯概率。关键在于,该结论不仅适用于标准缺失数据场景(完全随机缺失与随机缺失),在标准方法通常失效的非随机缺失场景中同样成立。通过实证研究,我们在分类、概率估计、校准和统计推断四个维度上对比了PbP与插补法及期望最大化方法。本分析为带缺失值的逻辑回归提供了全面视角:研究揭示在小样本场景中可采用均值插补作为基线方法,在大样本场景中则适用PbP策略,因为这两种方法训练速度快且在特定场景中可能表现良好。最佳性能由包含响应标签的非线性多重迭代插补技术(含响应的随机森林MICE)实现,但这类方法计算成本更高。