The performance of most causal effect estimators relies on accurate predictions of high-dimensional non-linear functions of the observed data. The remarkable flexibility of modern Machine Learning (ML) methods is perfectly suited to this task. However, data-driven hyperparameter tuning of ML methods requires effective model evaluation to avoid large errors in causal estimates, a task made more challenging because causal inference involves unavailable counterfactuals. Multiple performance-validation metrics have recently been proposed such that practitioners now not only have to make complex decisions about which causal estimators, ML learners and hyperparameters to choose, but also about which evaluation metric to use. This paper, motivated by unclear recommendations, investigates the interplay between the four different aspects of model evaluation for causal effect estimation. We develop a comprehensive experimental setup that involves many commonly used causal estimators, ML methods and evaluation approaches and apply it to four well-known causal inference benchmark datasets. Our results suggest that optimal hyperparameter tuning of ML learners is enough to reach state-of-the-art performance in effect estimation, regardless of estimators and learners. We conclude that most causal estimators are roughly equivalent in performance if tuned thoroughly enough. We also find hyperparameter tuning and model evaluation are much more important than causal estimators and ML methods. Finally, from the significant gap we find in estimation performance of popular evaluation metrics compared with optimal model selection choices, we call for more research into causal model evaluation to unlock the optimum performance not currently being delivered even by state-of-the-art procedures.
翻译:大多数因果效应估计器的性能依赖于对观测数据中高维非线性函数的准确预测。现代机器学习方法的高度灵活性恰好适用于这一任务。然而,基于数据的机器学习方法超参数调优需要有效的模型评估以避免因果估计中的较大误差,而这一任务因因果推断涉及无法观测的反事实而更具挑战性。近期提出了多种性能验证指标,导致实践者不仅要就选择何种因果估计器、机器学习学习器及超参数做出复杂决策,还需决定使用哪种评估指标。本文针对建议不明确的问题,研究了因果效应估计中模型评估四个不同方面之间的相互作用。我们构建了一个涵盖多种常用因果估计器、机器学习方法和评估方法的综合实验框架,并将其应用于四个著名的因果推断基准数据集。研究结果表明,无论采用何种估计器和学习器,仅对机器学习学习器进行最优超参数调优即可达到效应估计的先进水平。我们得出结论:若进行充分彻底的调优,大多数因果估计器在性能上大致相当。同时发现,超参数调优和模型评估的重要性远高于因果估计器和机器学习方法。最后,通过对比常用评估指标的估计性能与最优模型选择方案之间的显著差距,我们呼吁加强对因果模型评估的研究,以释放当前即使采用先进程序也未能实现的最优性能。