We study the problem of model selection in causal inference, specifically for the case of conditional average treatment effect (CATE) estimation under binary treatments. Unlike model selection in machine learning, there is no perfect analogue of cross-validation as we do not observe the counterfactual potential outcome for any data point. Towards this, there have been a variety of proxy metrics proposed in the literature, that depend on auxiliary nuisance models estimated from the observed data (propensity score model, outcome regression model). However, the effectiveness of these metrics has only been studied on synthetic datasets as we can access the counterfactual data for them. We conduct an extensive empirical analysis to judge the performance of these metrics introduced in the literature, and novel ones introduced in this work, where we utilize the latest advances in generative modeling to incorporate multiple realistic datasets. Our analysis suggests novel model selection strategies based on careful hyperparameter tuning of CATE estimators and causal ensembling.
翻译:我们研究了因果推断中的模型选择问题,具体针对二元处理条件下条件平均处理效应(CATE)估计的情形。与机器学习中的模型选择不同,由于我们无法观测到任何数据点的反事实潜在结果,因此不存在与交叉验证完美对应的方法。为此,文献中提出了多种代理指标,这些指标依赖于从观测数据中估计的辅助干扰参数模型(倾向得分模型、结果回归模型)。然而,这些指标的有效性此前仅在可获取反事实数据的合成数据集上得到验证。我们开展了广泛的实证分析,以评估文献中已有指标以及本文提出的新指标——后者利用生成式建模的最新进展融入了多个真实数据集。我们的分析提出了基于CATE估计器超参数精细调优与因果集成的新型模型选择策略。