Missing values are unavoidable in many applications of machine learning and present challenges both during training and at test time. When variables are missing in recurring patterns, fitting separate pattern submodels have been proposed as a solution. However, fitting models independently does not make efficient use of all available data. Conversely, fitting a single shared model to the full data set relies on imputation which often leads to biased results when missingness depends on unobserved factors. We propose an alternative approach, called sharing pattern submodels, which i) makes predictions that are robust to missing values at test time, ii) maintains or improves the predictive power of pattern submodels, and iii) has a short description, enabling improved interpretability. Parameter sharing is enforced through sparsity-inducing regularization which we prove leads to consistent estimation. Finally, we give conditions for when a sharing model is optimal, even when both missingness and the target outcome depend on unobserved variables. Classification and regression experiments on synthetic and real-world data sets demonstrate that our models achieve a favorable tradeoff between pattern specialization and information sharing.
翻译:缺失值是机器学习众多应用中不可避免的问题,在训练与测试阶段均构成挑战。当变量以重复模式缺失时,拟合独立的模式子模型已被提出作为解决方案。然而,独立拟合模型无法充分利用所有可用数据。相反,对整个数据集拟合单一共享模型依赖于插补方法,当缺失机制取决于未观测因素时,这往往会导致偏倚结果。我们提出一种替代方法——共享模式子模型,该方法能够:(i)在测试时生成对缺失值稳健的预测;(ii)维持或提升模式子模型的预测能力;(iii)具有简洁的表述形式,从而增强可解释性。通过稀疏性诱导正则化强制实现参数共享,我们证明该方法能产生一致估计。最后,我们给出了共享模型达到最优的条件——即使当缺失机制与目标结果均依赖于未观测变量时仍成立。在合成数据集与真实数据集上的分类与回归实验表明,我们的模型在模式专门化与信息共享之间实现了有利的权衡。