We propose a simple, statistically principled, and theoretically justified method to improve supervised learning when the training set is not representative, a situation known as covariate shift. We build upon a well-established methodology in causal inference, and show that the effects of covariate shift can be reduced or eliminated by conditioning on propensity scores. In practice, this is achieved by fitting learners within strata constructed by partitioning the data based on the estimated propensity scores, leading to approximately balanced covariates and much-improved target prediction. We demonstrate the effectiveness of our general-purpose method on two contemporary research questions in cosmology, outperforming state-of-the-art importance weighting methods. We obtain the best reported AUC (0.958) on the updated "Supernovae photometric classification challenge", and we improve upon existing conditional density estimation of galaxy redshift from Sloan Data Sky Survey (SDSS) data.
翻译:我们提出了一种简洁、基于统计原理且具有理论依据的方法,用于改善训练集不具有代表性(即协变量偏移)情况下的监督学习。该方法建立在因果推断中成熟的方法论基础上,表明通过基于倾向得分进行条件化处理,可以降低或消除协变量偏移的影响。在实践中,这一方法通过根据估计的倾向得分将数据分层,并在各层内分别训练学习器,从而实现协变量近似平衡并显著提升目标预测性能。我们通过宇宙学领域两个当代研究问题验证了该通用方法的有效性,其表现优于最先进的加权方法:在更新的“超新星光度分类挑战”中取得了最佳AUC(0.958),并在基于斯隆数字巡天(SDSS)数据的星系红移条件密度估计任务中提升了现有方法的表现。