When estimating a regression model, we might have data where some labels are missing, or our data might be biased by a selection mechanism. When the response or selection mechanism is ignorable (i.e., independent of the response variable given the features) one can use off-the-shelf regression methods; in the nonignorable case one typically has to adjust for bias. We observe that privileged data (i.e. data that is only available during training) might render a nonignorable selection mechanism ignorable, and we refer to this scenario as Privilegedly Missing at Random (PMAR). We propose a novel imputation-based regression method, named repeated regression, that is suitable for PMAR. We also consider an importance weighted regression method, and a doubly robust combination of the two. The proposed methods are easy to implement with most popular out-of-the-box regression algorithms. We empirically assess the performance of the proposed methods with extensive simulated experiments and on a synthetically augmented real-world dataset. We conclude that repeated regression can appropriately correct for bias, and can have considerable advantage over weighted regression, especially when extrapolating to regions of the feature space where response is never observed.
翻译:在估计回归模型时,我们可能面临部分标签缺失的数据,或数据受选择机制影响而产生偏差。当响应或选择机制是可忽略的(即给定特征条件下,与响应变量独立),可直接使用现成回归方法;而在非可忽略情形下则需进行偏差校正。我们观察到特权数据(即仅在训练阶段可获取的数据)可能使非可忽略的选择机制变为可忽略,并将此场景称为"特权随机缺失"。针对此情形,我们提出一种新型基于插补的回归方法——重复回归,并进一步探讨了重要性加权回归法及两者的双稳健组合方案。所提方法易于集成至主流现成回归算法中。我们通过大规模模拟实验与合成增强的真实数据集实证评估了各方法的性能。结果表明:重复回归能有效校正偏差,尤其在响应变量从未被观测的特征空间外推区域,其性能显著优于加权回归方法。