Personalized treatment effect estimates are often of interest in high-stakes applications -- thus, before deploying a model estimating such effects in practice, one needs to be sure that the best candidate from the ever-growing machine learning toolbox for this task was chosen. Unfortunately, due to the absence of counterfactual information in practice, it is usually not possible to rely on standard validation metrics for doing so, leading to a well-known model selection dilemma in the treatment effect estimation literature. While some solutions have recently been investigated, systematic understanding of the strengths and weaknesses of different model selection criteria is still lacking. In this paper, instead of attempting to declare a global `winner', we therefore empirically investigate success- and failure modes of different selection criteria. We highlight that there is a complex interplay between selection strategies, candidate estimators and the data used for comparing them, and provide interesting insights into the relative (dis)advantages of different criteria alongside desiderata for the design of further illuminating empirical studies in this context.
翻译:个性化处理效应估计在高风险应用中常备受关注——因此,在将此类效应的估计模型实际部署前,需确保从日益增长的机器学习工具库中选出了最佳候选模型。然而,由于实践中反事实信息的缺失,通常无法依赖标准验证指标进行选择,这导致了处理效应估计文献中著名的模型选择困境。尽管近期已有若干解决方案得到探索,但学界对各类模型选择标准优缺点的系统性认知仍显匮乏。本文不试图宣告全局性"优胜者",而是通过实证研究考察不同选择标准的成功与失败模式。我们揭示出选择策略、候选估计量及用于比较的数据之间存在复杂的交互作用,并针对不同标准的相对优势与局限性提出重要见解,同时为未来在此领域设计更具启发性的实证研究提供了应遵循的准则。