When estimating a regression model, we might have data where some labels are missing, or our data might be biased by a selection mechanism. When the response or selection mechanism is ignorable (i.e., independent of the response variable given the features) one can use off-the-shelf regression methods; in the nonignorable case one typically has to adjust for bias. We observe that privileged information (i.e. information that is only available during training) might render a nonignorable selection mechanism ignorable, and we refer to this scenario as Privilegedly Missing at Random (PMAR). We propose a novel imputation-based regression method, named repeated regression, that is suitable for PMAR. We also consider an importance weighted regression method, and a doubly robust combination of the two. The proposed methods are easy to implement with most popular out-of-the-box regression algorithms. We empirically assess the performance of the proposed methods with extensive simulated experiments and on a synthetically augmented real-world dataset. We conclude that repeated regression can appropriately correct for bias, and can have considerable advantage over weighted regression, especially when extrapolating to regions of the feature space where response is never observed.
翻译:在估计回归模型时,我们可能遇到部分标签缺失的数据,或数据因选择机制而存在偏差。当响应或选择机制是可忽略的(即给定特征后与响应变量独立),可以使用现成的回归方法;在非可忽略的情况下,通常需要对偏差进行校正。我们发现,特权信息(即仅在训练时可用信息)可能使非可忽略的选择机制变为可忽略,并将此情形称为特权随机缺失(PMAR)。我们提出一种新的基于插补的回归方法——重复回归,该方法适用于PMAR。同时,我们考虑了重要性加权回归方法,以及两者的双重稳健组合。所提方法易于使用大多数流行的现成回归算法实现。我们通过大量模拟实验和一个人工增强的真实数据集,实证评估了所提方法的性能。结论表明,重复回归能够有效校正偏差,且相较于加权回归具有显著优势,特别是在需要外推至响应从未被观测到的特征空间区域时。