When training predictive models on data with missing entries, the most widely used and versatile approach is a pipeline technique where we first impute missing entries and then compute predictions. In this paper, we view prediction with missing data as a two-stage adaptive optimization problem and propose a new class of models, adaptive linear regression models, where the regression coefficients adapt to the set of observed features. We show that some adaptive linear regression models are equivalent to learning an imputation rule and a downstream linear regression model simultaneously instead of sequentially. We leverage this joint-impute-then-regress interpretation to generalize our framework to non-linear models. In settings where data is strongly not missing at random, our methods achieve a 2-10% improvement in out-of-sample accuracy.
翻译:在存在缺失条目的数据上训练预测模型时,最广泛使用且通用的方法是采用管道技术:首先对缺失条目进行插补,随后计算预测结果。本文将含缺失数据的预测问题视为一个两阶段自适应优化问题,并提出一类新模型——自适应线性回归模型,其回归系数会根据观测到的特征集合进行自适应调整。研究表明,部分自适应线性回归模型实际上等价于同时学习一个插补规则和一个下游线性回归模型,而非分步进行。我们利用这种联合插补-回归的视角,将框架推广至非线性模型。在数据非随机缺失较为严重的情况下,我们的方法能将样本外准确率提升2-10%。