Personalized treatment effect estimates are often of interest in high-stakes applications -- thus, before deploying a model estimating such effects in practice, one needs to be sure that the best candidate from the ever-growing machine learning toolbox for this task was chosen. Unfortunately, due to the absence of counterfactual information in practice, it is usually not possible to rely on standard validation metrics for doing so, leading to a well-known model selection dilemma in the treatment effect estimation literature. While some solutions have recently been investigated, systematic understanding of the strengths and weaknesses of different model selection criteria is still lacking. In this paper, instead of attempting to declare a global `winner', we therefore empirically investigate success- and failure modes of different selection criteria. We highlight that there is a complex interplay between selection strategies, candidate estimators and the DGP used for testing, and provide interesting insights into the relative (dis)advantages of different criteria alongside desiderata for the design of further illuminating empirical studies in this context.
翻译:个性化治疗效果估计在高风险应用中常受到关注——因此,在将评估此类效果的模型部署到实践中前,需确保从日益增长的机器学习工具库中选出了最优候选模型。然而,由于现实中反事实信息的缺失,通常无法依赖标准验证指标进行此选择,导致治疗效果估计文献中著名的模型选择困境。尽管近期已有一些解决方案被探讨,但对不同模型选择标准优劣的系统性理解仍显不足。本文不试图宣称全局性的“胜者”,而是通过实证研究考察不同选择标准的成功与失败模式。我们揭示了选择策略、候选估计器与测试所用数据生成过程之间复杂的相互作用,并针对不同标准的相对优势与局限提供了深刻洞见,同时提出了在此背景下设计更富有启发性的实证研究时应考虑的原则要求。