Rates of missing data often depend on record-keeping policies and thus may change across times and locations, even when the underlying features are comparatively stable. In this paper, we introduce the problem of Domain Adaptation under Missingness Shift (DAMS). Here, (labeled) source data and (unlabeled) target data would be exchangeable but for different missing data mechanisms. We show that if missing data indicators are available, DAMS reduces to covariate shift. Addressing cases where such indicators are absent, we establish the following theoretical results for underreporting completely at random: (i) covariate shift is violated (adaptation is required); (ii) the optimal linear source predictor can perform arbitrarily worse on the target domain than always predicting the mean; (iii) the optimal target predictor can be identified, even when the missingness rates themselves are not; and (iv) for linear models, a simple analytic adjustment yields consistent estimates of the optimal target parameters. In experiments on synthetic and semi-synthetic data, we demonstrate the promise of our methods when assumptions hold. Finally, we discuss a rich family of future extensions.
翻译:数据缺失率通常取决于记录保存策略,因此即使底层特征相对稳定,也可能随时间和地点发生变化。本文提出了缺失情形迁移下的域自适应(DAMS)问题。在此问题中,(有标签)源域数据与(无标签)目标域数据本是可交换的,但二者的缺失数据机制不同。我们证明,若缺失数据指示符可用,则DAMS退化为协变量偏移。针对指示符缺失的情况,我们建立了完全随机缺失下以下理论结果:(i)协变量偏移不成立(需进行自适应);(ii)最优线性源预测器在目标域上的表现可能比始终预测均值更差;(iii)即使缺失率本身不可识别,最优目标预测器仍可被识别;(iv)对于线性模型,一个简单的解析调整可获得最优目标参数的一致估计。基于合成数据与半合成数据的实验表明,在假设成立时我们的方法具有应用前景。最后,我们讨论了丰富的未来扩展方向。