A major challenge in data-driven decision-making is accurate policy evaluation-i.e., guaranteeing that a learned decision-making policy achieves the promised benefits. A popular strategy is model-based policy evaluation, which estimates a model from data to infer counterfactual outcomes. This strategy is known to produce unwarrantedly optimistic estimates of the true benefit due to the winner's curse. We searched the recent literature on data-driven decision-making, identifying a sample of 55 papers published in the Management Science in the past decade; all but two relied on this flawed methodology. Several common justifications are provided: (1) the estimated models are accurate, stable, and well-calibrated, (2) the historical data uses random treatment assignment, (3) the model family is well-specified, and (4) the evaluation methodology uses sample splitting. Unfortunately, we show that no combination of these justifications avoids the winner's curse. First, we provide a theoretical analysis demonstrating that the winner's curse can cause large, spurious reported benefits even when all these justifications hold. Second, we perform a simulation study based on the recent and consequential data-driven refugee matching problem. We construct a synthetic refugee matching environment (calibrated to closely match the real setting) but designed so that no assignment policy can improve expected employment compared to random assignment. Model-based methods report large, stable gains of around 60% even when the true effect is zero; these gains are on par with improvements of 22-75% reported in the literature. Our results provide strong evidence against model-based evaluation.
翻译:数据驱动决策面临的一个主要挑战是政策评估的准确性——即确保学习到的决策策略能够实现所承诺的效益。一种流行策略是基于模型的策略评估,该方法通过从数据中估计模型来推断反事实结果。由于赢家诅咒的存在,这种策略已知会产生对真实效益过度乐观的估计。我们检索了近期关于数据驱动决策的文献,选取了过去十年在《管理科学》期刊上发表的55篇论文作为样本;其中除两篇外均采用了这种存在缺陷的方法论。相关研究通常提供以下几种辩护理由:(1) 估计模型具有准确性、稳定性和良好的校准性,(2) 历史数据采用随机处理分配,(3) 模型族设定正确,(4) 评估方法采用样本分割。遗憾的是,我们证明这些辩护理由的任何组合都无法避免赢家诅咒。首先,我们通过理论分析表明,即使所有辩护理由都成立,赢家诅咒仍可能导致报告中出现巨大且虚假的效益。其次,我们基于近期具有重要影响的难民匹配问题进行了仿真研究。我们构建了一个合成难民匹配环境(其校准参数与真实场景高度吻合),但特别设计为任何分配策略都无法比随机分配提高预期就业率。基于模型的方法即使真实效应为零时,仍报告了约60%的稳定高增益;该增益幅度与文献中报告的22-75%的改进幅度相当。我们的研究结果为反对基于模型的评估方法提供了有力证据。