基于双重机器学习的预测迁移：预测部分未观测学生学业成果 (Transporting Predictions via Double Machine Learning: Predicting Partially Unobserved Students' Outcomes)

Educational policymakers often lack data on student outcomes in regions where standardized tests were not administered. Machine learning techniques can be used to predict unobserved outcomes in target populations by training models on data from a source population. However, differences between the source and target populations, particularly in covariate distributions, can reduce the transportability of these models, potentially reducing predictive accuracy and introducing bias. We propose using double machine learning for a covariate-shift weighted model. First, we estimate the overlap score-namely, the probability that an observation belongs to the source dataset given its covariates. Second, balancing weights, defined as the density ratio of target-to-source membership probabilities, are used to reweight the individual observations' contribution to the loss or likelihood function in the target outcome prediction model. This approach downweights source observations that are less similar to the target population, allowing predictions to rely more heavily on observations with greater overlap. As a result, predictions become more generalizable under covariate shift. We illustrate this framework in the context of uncertain data on students' standardized financial literacy scores (FLS). Using Bayesian Additive Regression Trees (BART), we predict missing FLS. We find minimal differences in predictive performance between the weighted and unweighted models, suggesting limited covariate shift in our empirical setting. Nonetheless, the proposed approach provides a principled framework for addressing covariate shift and is broadly applicable to predictive modeling in the social and health sciences, where differences between source and target populations are common.

翻译：教育政策制定者常面临标准化测试未实施地区学生学业成果数据缺失的问题。机器学习技术可通过在源群体数据上训练模型，预测目标群体中未观测的成果。然而，源群体与目标群体之间的差异（特别是协变量分布的差异）会降低这些模型的可迁移性，可能导致预测准确性下降并引入偏差。本文提出采用双重机器学习构建协变量偏移加权模型：首先估计重叠分数（即给定协变量条件下观测样本属于源数据集的概率）；其次定义平衡权重（目标群体与源群体隶属概率的密度比），用于重新加权个体观测在目标成果预测模型的损失函数或似然函数中的贡献度。该方法降低与目标群体相似度较低的源观测样本的权重，使预测更依赖于重叠度较高的观测样本，从而提升模型在协变量偏移下的泛化能力。我们以学生标准化金融素养分数（FLS）的不确定性数据为例，采用贝叶斯加性回归树（BART）预测缺失的FLS。研究发现加权模型与未加权模型的预测性能差异极小，表明实证场景中协变量偏移有限。尽管如此，所提方法为处理协变量偏移提供了理论框架，可广泛适用于源群体与目标群体存在差异的社会科学与健康科学预测建模领域。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日